libav.git - [no description]

	Commit message (Collapse)	Author	Age
*	dca_core: convert to lavu/tx	Lynne	2022-11-06
\| \| \| \| \|	Thanks to Martin Storsjö <martin@martin.st> for fixing and testing the arm32 and aarch64 changes.
*	lavc/aarch64: add hevc horizontal qpel/uni/bi	J. Dekker	2022-10-25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	checkasm --benchmark on Ampere Altra (Neoverse N1): put_hevc_qpel_bi_h4_8_c: 170.7 put_hevc_qpel_bi_h4_8_neon: 64.5 put_hevc_qpel_bi_h6_8_c: 373.7 put_hevc_qpel_bi_h6_8_neon: 130.2 put_hevc_qpel_bi_h8_8_c: 662.0 put_hevc_qpel_bi_h8_8_neon: 138.5 put_hevc_qpel_bi_h12_8_c: 1529.5 put_hevc_qpel_bi_h12_8_neon: 422.0 put_hevc_qpel_bi_h16_8_c: 2735.5 put_hevc_qpel_bi_h16_8_neon: 560.5 put_hevc_qpel_bi_h24_8_c: 6015.7 put_hevc_qpel_bi_h24_8_neon: 1636.0 put_hevc_qpel_bi_h32_8_c: 10779.0 put_hevc_qpel_bi_h32_8_neon: 2204.5 put_hevc_qpel_bi_h48_8_c: 24375.0 put_hevc_qpel_bi_h48_8_neon: 4984.0 put_hevc_qpel_bi_h64_8_c: 42768.0 put_hevc_qpel_bi_h64_8_neon: 8795.7 put_hevc_qpel_h4_8_c: 149.0 put_hevc_qpel_h4_8_neon: 55.7 put_hevc_qpel_h6_8_c: 321.2 put_hevc_qpel_h6_8_neon: 106.0 put_hevc_qpel_h8_8_c: 578.7 put_hevc_qpel_h8_8_neon: 133.2 put_hevc_qpel_h12_8_c: 1279.0 put_hevc_qpel_h12_8_neon: 391.7 put_hevc_qpel_h16_8_c: 2286.2 put_hevc_qpel_h16_8_neon: 519.7 put_hevc_qpel_h24_8_c: 5100.7 put_hevc_qpel_h24_8_neon: 1546.2 put_hevc_qpel_h32_8_c: 9022.0 put_hevc_qpel_h32_8_neon: 2060.2 put_hevc_qpel_h48_8_c: 20293.5 put_hevc_qpel_h48_8_neon: 4656.7 put_hevc_qpel_h64_8_c: 36037.0 put_hevc_qpel_h64_8_neon: 8262.7 put_hevc_qpel_uni_h4_8_c: 162.2 put_hevc_qpel_uni_h4_8_neon: 61.7 put_hevc_qpel_uni_h6_8_c: 355.2 put_hevc_qpel_uni_h6_8_neon: 114.2 put_hevc_qpel_uni_h8_8_c: 651.0 put_hevc_qpel_uni_h8_8_neon: 135.7 put_hevc_qpel_uni_h12_8_c: 1412.5 put_hevc_qpel_uni_h12_8_neon: 402.7 put_hevc_qpel_uni_h16_8_c: 2551.0 put_hevc_qpel_uni_h16_8_neon: 533.5 put_hevc_qpel_uni_h24_8_c: 5782.2 put_hevc_qpel_uni_h24_8_neon: 1578.7 put_hevc_qpel_uni_h32_8_c: 10586.5 put_hevc_qpel_uni_h32_8_neon: 2102.2 put_hevc_qpel_uni_h48_8_c: 23812.0 put_hevc_qpel_uni_h48_8_neon: 4739.5 put_hevc_qpel_uni_h64_8_c: 42958.7 put_hevc_qpel_uni_h64_8_neon: 8366.5 Signed-off-by: J. Dekker <jdek@itanimul.li>
*	aarch64: Implement stack spilling in a consistent way.	Reimar Döffinger	2022-10-11
\| \| \| \| \| \| \| \| \|	Currently it is done in several different ways, which might cause needless dependencies or in case of tx_float_neon.S is incorrect. Reviewed-by: Martin Storsjö <martin@martin.st> Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
*	lavc/aarch64: Add neon implementation for vsse_intra8	Grzegorz Bernacki	2022-10-04
\| \| \| \| \| \| \| \| \| \| \| \| \|	Provide optimized implementation for vsse_intra8 for arm64. Performance tests are shown below. - vsse_5_c: 87.7 - vsse_5_neon: 26.2 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Co-authored-by: Martin Storsjö <martin@martin.st> Signed-off-by: Martin Storsjö <martin@martin.st>
*	lavc/aarch64: Provide optimized implementation of vsse8 for arm64.	Grzegorz Bernacki	2022-10-04
\| \| \| \| \| \| \| \| \| \| \| \| \|	Provide optimized implementation of vsse8 for arm64. Performance comparison tests are shown below. - vsse_1_c: 141.5 - vsse_1_neon: 32.5 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Signed-off-by: Grzegorz Bernacki <gjb@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>
*	lavc/aarch64: Provide neon implementation of nsse8	Grzegorz Bernacki	2022-10-04
\| \| \| \| \| \| \| \| \| \| \| \| \|	Add vectorized implementation of nsse8 function. Performance comparison tests are shown below. - nsse_1_c: 256.0 - nsse_1_neon: 82.7 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Grzegorz Bernacki <gjb@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>
*	lavc/aarch64: Add neon implementation for pix_abs8 functions.	Grzegorz Bernacki	2022-10-04
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Provide optimized implementation of pix_abs8 function for arm64. Performance comparison tests are shown below: pix_abs_1_1_c: 162.5 pix_abs_1_1_neon: 27.0 pix_abs_1_2_c: 174.0 pix_abs_1_2_neon: 23.5 pix_abs_1_3_c: 203.2 pix_abs_1_3_neon: 34.7 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Co-authored-by: Martin Storsjö <martin@martin.st> Signed-off-by: Grzegorz Bernacki <gjb@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>
*	aarch64: me_cmp: Avoid using the non-unrolled codepath for the minimum ↵	Martin Storsjö	2022-09-29
\| \| \| \| \| \|	unroll size Signed-off-by: Martin Storsjö <martin@martin.st>
*	aarch64: me_cmp: Avoid redundant loads in ff_pix_abs16_y2_neon	Martin Storsjö	2022-09-29
\| \| \| \| \| \| \| \| \| \| \| \|	This avoids one redundant load per row; pix3 from the previous iteration can be used as pix2 in the next one. Before: Cortex A53 A72 A73 pix_abs_0_2_neon: 138.0 59.7 48.0 After: pix_abs_0_2_neon: 109.7 50.2 39.5 Signed-off-by: Martin Storsjö <martin@martin.st>
*	avcodec/fmtconvert: Remove unused AVCodecContext parameter	Andreas Rheinhardt	2022-09-21
\| \| \| \| \| \| \|	Unused since d74a8cb7e42f703be5796eeb485f06af710ae8ca. Reviewed-by: Rémi Denis-Courmont <remi@remlab.net> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
*	lavc/aarch64: Add neon implementation for pix_median_abs8	Hubert Mazur	2022-09-21
\| \| \| \| \| \| \| \| \| \| \| \| \|	Provide optimized implementation for pix_median_abs8 function. Performance comparison tests are shown below. - median_sad_1_c: 277.0 - median_sad_1_neon: 82.0 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>
*	lavc/aarch64: Add neon implementation for vsad8_intra	Hubert Mazur	2022-09-21
\| \| \| \| \| \| \| \| \| \| \| \| \|	Provide optimized implementation for vsad8_intra function. Performance comparison tests are shown below. - vsad_5_c: 94.7 - vsad_5_neon: 20.7 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>
*	lavc/aarch64: Add neon implementation for pix_median_abs16	Hubert Mazur	2022-09-21
\| \| \| \| \| \| \| \| \| \| \| \| \|	Provide optimized implementation for pix_median_abs16 function. Performance comparison tests are shown below. - median_sad_0_c: 720.5 - median_sad_0_neon: 127.2 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>
*	lavc/vorbisdsp: use ptrdiff_t rather than intptr_t	Rémi Denis-Courmont	2022-09-19
\| \| \| \|	... for a difference between pointers.
*	avcodec/vp8dsp: Constify src in vp8_mc_func	Andreas Rheinhardt	2022-09-11
\| \| \| \| \| \|	Reviewed-by: Peter Ross <pross@xvid.org> Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
*	lavc/aarch64: Provide neon implementation of nsse16	Hubert Mazur	2022-09-09
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add vectorized implementation of nsse16 function. Performance comparison tests are shown below. - nsse_0_c: 682.2 - nsse_0_neon: 116.5 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Co-authored-by: Martin Storsjö <martin@martin.st> Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>
*	lavc/aarch64: Add neon implementation for vsse_intra16	Hubert Mazur	2022-09-09
\| \| \| \| \| \| \| \| \| \| \| \| \|	Provide optimized implementation for vsse_intra16 for arm64. Performance tests are shown below. - vsse_4_c: 155.2 - vsse_4_neon: 36.2 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>
*	lavc/aarch64: Add neon implementation for vsad_intra16	Hubert Mazur	2022-09-09
\| \| \| \| \| \| \| \| \| \| \| \| \|	Provide optimized implementation for vsad_intra16 function for arm64. Performance comparison tests are shown below. - vsad_4_c: 177.5 - vsad_4_neon: 23.5 Benchmarks and tests are run with checkasm tool on AWS Gravtion 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>
*	lavc/aarch64: Add neon implementation of vsse16	Hubert Mazur	2022-09-09
\| \| \| \| \| \| \| \| \| \| \| \| \|	Provide optimized implementation of vsse16 for arm64. Performance comparison tests are shown below. - vsse_0_c: 257.7 - vsse_0_neon: 59.2 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>
*	lavc/aarch64: Add neon implementation for vsad16	Hubert Mazur	2022-09-09
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Provide optimized implementation of vsad16 function for arm64. Performance comparison tests are shown below. - vsad_0_c: 285.2 - vsad_0_neon: 39.5 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Co-authored-by: Martin Storsjö <martin@martin.st> Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>
*	arm/fft: disable NEON optimizations for 131072pt transforms	Lynne	2022-08-29
\| \| \| \| \| \| \| \| \|	This has been broken since the start, and it was only discovered when I started testing my replacement for the FFT. Disable it, since there's no point in fixing slower code that's about to be removed anyway. The vfp version is not affected.
*	lavc/aarch64: hevc_add_res add 12bit variants	J. Dekker	2022-08-18
\| \| \| \| \| \| \| \| \| \| \| \| \|	hevc_add_res_4x4_12_c: 46.0 hevc_add_res_4x4_12_neon: 18.7 hevc_add_res_8x8_12_c: 194.7 hevc_add_res_8x8_12_neon: 25.2 hevc_add_res_16x16_12_c: 716.0 hevc_add_res_16x16_12_neon: 69.7 hevc_add_res_32x32_12_c: 3820.7 hevc_add_res_32x32_12_neon: 261.0 Signed-off-by: J. Dekker <jdek@itanimul.li>
*	aarch64: me_cmp: Remove a leftover unnecessary instruction	Martin Storsjö	2022-08-18
\| \| \| \| \| \|	This was missed in a2e45ad407c526cd5ce2f3a361fb98084228cd6e. Signed-off-by: Martin Storsjö <martin@martin.st>
*	lavc/aarch64: Add neon implementation for pix_abs8	Hubert Mazur	2022-08-18
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Provide optimized implementation of pix_abs8 function for arm64. Performance comparison tests are shown below. - pix_abs_1_0_c: 101.2 - pix_abs_1_0_neon: 22.5 - sad_1_c: 101.2 - sad_1_neon: 22.5 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Signed-off-by: Martin Storsjö <martin@martin.st>
*	lavc/aarch64: Add neon implementation for sse8	Hubert Mazur	2022-08-18
\| \| \| \| \| \| \| \| \| \| \| \| \|	Provide optimized implementation of sse8 function for arm64. Performance comparison tests are shown below. - sse_1_c: 130.7 - sse_1_neon: 29.7 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>
*	lavc/aarch64: Add neon implementation for pix_abs16_y2	Hubert Mazur	2022-08-18
\| \| \| \| \| \| \| \| \| \| \| \| \|	Provide optimized implementation of pix_abs16_y2 function for arm64. Performance comparison tests are shown below. pix_abs_0_2_c: 317.2 pix_abs_0_2_neon: 37.5 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>
*	lavc/aarch64: Add neon implementation for sse4	Hubert Mazur	2022-08-18
\| \| \| \| \| \| \| \| \| \| \| \| \|	Provide neon implementation for sse4 function. Performance comparison tests are shown below. - sse_2_c: 80.7 - sse_2_neon: 31.0 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>
*	lavc/aarch64: Add neon implementation for sse16	Hubert Mazur	2022-08-18
\| \| \| \| \| \| \| \| \| \| \| \| \|	Provide neon implementation for sse16 function. Performance comparison tests are shown below. - sse_0_c: 268.2 - sse_0_neon: 43.5 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>
*	aarch64: me_cmp: Fix the indentation of function declarations	Martin Storsjö	2022-08-18
\| \| \| \|	Signed-off-by: Martin Storsjö <martin@martin.st>
*	lavc/aarch64: reformat add_res funcs	J. Dekker	2022-08-16
\| \| \| \|	Signed-off-by: J. Dekker <jdek@itanimul.li>
*	avcodec/h264chroma: Constify src in h264_chroma_mc_func	Andreas Rheinhardt	2022-08-05
\| \| \| \|	Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
*	avcodec/hevcdsp: Constify src pointers	Andreas Rheinhardt	2022-08-05
\| \| \| \|	Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
*	avcodec/me_cmp: Constify me_cmp_func buffer parameters	Andreas Rheinhardt	2022-07-31
\| \| \| \|	Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
*	avcodec/videodsp: Constify buf in VideoDSPContext.prefetch	Andreas Rheinhardt	2022-07-31
\| \| \| \|	Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
*	aarch64: me_cmp: Don't do uaddlv once per iteration	Martin Storsjö	2022-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The max height is currently documented as 16; the max difference per pixel is 255, and a .8h element can easily contain 16*255, thus keep accumulating in two .8h vectors, and just do the final accumulationat the end. This should work for heights up to 256. This requires a minor register renumbering in ff_pix_abs16_xy2_neon. Before: Cortex A53 A72 A73 Graviton 3 pix_abs_0_0_neon: 97.7 47.0 37.5 22.7 pix_abs_0_1_neon: 154.0 59.0 52.0 25.0 pix_abs_0_3_neon: 179.7 96.7 87.5 41.2 After: pix_abs_0_0_neon: 96.0 39.2 31.2 22.0 pix_abs_0_1_neon: 150.7 59.7 46.2 23.7 pix_abs_0_3_neon: 175.7 83.7 81.7 38.2 Signed-off-by: Martin Storsjö <martin@martin.st>
*	aarch64: me_cmp: Switch from uabd to uabal in ff_pix_abs16_xy2_neon	Martin Storsjö	2022-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Using absolute-difference-accumulate does use twice the amount of absolute-difference instructions, but avoids the need for the uaddl and add instructions, reducing the total number of instructions by 3. These can be interleaved in the rest of the calculation, to avoid tight dependencies at the end. Unfortunately, this is marginally slower on Cortex A53, but faster on A72 and A73. Before: Cortex A53 A72 A73 Graviton 3 pix_abs_0_3_neon: 175.7 109.2 92.0 41.2 After: pix_abs_0_3_neon: 179.7 96.7 87.5 41.2 Signed-off-by: Martin Storsjö <martin@martin.st>
*	aarch64: me_cmp: Interleave some of the loads in ff_pix_abs16_xy2_neon	Martin Storsjö	2022-07-16
\| \| \| \| \| \| \| \| \|	Before: Cortex A53 A72 A73 Graviton 3 pix_abs_0_3_neon: 183.7 112.7 97.5 41.2 After: pix_abs_0_3_neon: 175.7 109.2 92.0 41.2 Signed-off-by: Martin Storsjö <martin@martin.st>
*	libavcodec: aarch64: Don't clobber v8 in the h%4 case in ff_pix_abs16_xy2_neon	Martin Storsjö	2022-07-16
\| \| \| \| \| \|	Checkasm doesn't currently test this codepath. Signed-off-by: Martin Storsjö <martin@martin.st>
*	lavc/aarch64: Add pix_abs16_x2 neon implementation	Hubert Mazur	2022-07-13
\| \| \| \| \| \| \| \| \| \| \| \| \|	Provide neon implementation for pix_abs16_x2 function. Performance tests of implementation are below. - pix_abs_0_1_c: 283.5 - pix_abs_0_1_neon: 39.0 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>
*	lavc/aarch64: Hook up the existing ff_pix_abs16_neon to the sad[0] function ↵	Hubert Mazur	2022-07-11
\| \| \| \| \| \| \|	pointer Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>
*	lavc/aarch64: motion estimation functions in neon	Swinney, Jonathan	2022-06-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- ff_pix_abs16_neon - ff_pix_abs16_xy2_neon In direct micro benchmarks of these ff functions verses their C implementations, these functions performed as follows on AWS Graviton 3. ff_pix_abs16_neon: pix_abs_0_0_c: 141.1 pix_abs_0_0_neon: 19.6 ff_pix_abs16_xy2_neon: pix_abs_0_3_c: 269.1 pix_abs_0_3_neon: 39.3 Tested with: ./tests/checkasm/checkasm --test=motion --bench --disable-linux-perf Signed-off-by: Jonathan Swinney <jswinney@amazon.com> Signed-off-by: Martin Storsjö <martin@martin.st>
*	lavc/aarch64: hevc_sao reschedule slightly	J. Dekker	2022-05-26
\| \| \| \|	Signed-off-by: J. Dekker <jdek@itanimul.li>
*	lavc/aarch64: add hevc sao edge 8x8	J. Dekker	2022-05-25
\| \| \| \| \| \| \| \| \|	bench on AWS Graviton: hevc_sao_edge_8x8_8_c: 516.0 hevc_sao_edge_8x8_8_neon: 81.0 Signed-off-by: J. Dekker <jdek@itanimul.li>
*	lavc/aarch64: add hevc sao edge 16x16	J. Dekker	2022-05-25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	bench on AWS Graviton: hevc_sao_edge_16x16_8_c: 1857.0 hevc_sao_edge_16x16_8_neon: 211.0 hevc_sao_edge_32x32_8_c: 7802.2 hevc_sao_edge_32x32_8_neon: 808.2 hevc_sao_edge_48x48_8_c: 16764.2 hevc_sao_edge_48x48_8_neon: 1796.5 hevc_sao_edge_64x64_8_c: 32647.5 hevc_sao_edge_64x64_8_neon: 3118.5 Signed-off-by: J. Dekker <jdek@itanimul.li>
*	lavc/aarch64: fix hevc sao band filter	J. Dekker	2022-05-25
\| \| \| \| \| \| \|	The SAO band filter can be called with non-multiples of 8, we round up to the nearest multiple of 8 to account for this. Signed-off-by: J. Dekker <jdek@itanimul.li>
*	arm64: Fix wrong BTI landing pad	Andre Kempe	2022-04-26
\| \| \| \| \| \| \| \| \| \| \| \|	This patch fixes a wrong type of BTI landing pad when branching to functions instantiated via the fft*_neon macro. Although the previously employed paciasp instruction serves as a landing pad, for the ways that this function is invoked it is the wrong type, resulting in an unexpected termination of the running process. Signed-off-by: André Kempe <andre.kempe@arm.com> Signed-off-by: Martin Storsjö <martin@martin.st>
*	avcodec/vc1: Arm 64-bit NEON unescape fast path	Ben Avison	2022-04-01
\| \| \| \| \| \| \| \| \| \|	checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. vc1dsp.vc1_unescape_buffer_c: 655617.7 vc1dsp.vc1_unescape_buffer_neon: 118237.0 Signed-off-by: Ben Avison <bavison@riscosopen.org> Signed-off-by: Martin Storsjö <martin@martin.st>
*	avcodec/idctdsp: Arm 64-bit NEON block add and clamp fast paths	Ben Avison	2022-04-01
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. idctdsp.add_pixels_clamped_c: 313.3 idctdsp.add_pixels_clamped_neon: 24.3 idctdsp.put_pixels_clamped_c: 220.3 idctdsp.put_pixels_clamped_neon: 15.5 idctdsp.put_signed_pixels_clamped_c: 210.5 idctdsp.put_signed_pixels_clamped_neon: 19.5 Signed-off-by: Ben Avison <bavison@riscosopen.org> Signed-off-by: Martin Storsjö <martin@martin.st>
*	avcodec/vc1: Arm 64-bit NEON inverse transform fast paths	Ben Avison	2022-04-01
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. vc1dsp.vc1_inv_trans_4x4_c: 158.2 vc1dsp.vc1_inv_trans_4x4_neon: 65.7 vc1dsp.vc1_inv_trans_4x4_dc_c: 86.5 vc1dsp.vc1_inv_trans_4x4_dc_neon: 26.5 vc1dsp.vc1_inv_trans_4x8_c: 335.2 vc1dsp.vc1_inv_trans_4x8_neon: 106.2 vc1dsp.vc1_inv_trans_4x8_dc_c: 151.2 vc1dsp.vc1_inv_trans_4x8_dc_neon: 25.5 vc1dsp.vc1_inv_trans_8x4_c: 365.7 vc1dsp.vc1_inv_trans_8x4_neon: 97.2 vc1dsp.vc1_inv_trans_8x4_dc_c: 139.7 vc1dsp.vc1_inv_trans_8x4_dc_neon: 16.5 vc1dsp.vc1_inv_trans_8x8_c: 547.7 vc1dsp.vc1_inv_trans_8x8_neon: 137.0 vc1dsp.vc1_inv_trans_8x8_dc_c: 268.2 vc1dsp.vc1_inv_trans_8x8_dc_neon: 30.5 Signed-off-by: Ben Avison <bavison@riscosopen.org> Signed-off-by: Martin Storsjö <martin@martin.st>
*	avcodec/vc1: Arm 64-bit NEON deblocking filter fast paths	Ben Avison	2022-04-01
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. Note that the C version can still outperform the NEON version in specific cases. The balance between different code paths is stream-dependent, but in practice the best case happens about 5% of the time, the worst case happens about 40% of the time, and the complexity of the remaining cases fall somewhere in between. Therefore, taking the average of the best and worst case timings is probably a conservative estimate of the degree by which the NEON code improves performance. vc1dsp.vc1_h_loop_filter4_bestcase_c: 10.7 vc1dsp.vc1_h_loop_filter4_bestcase_neon: 43.5 vc1dsp.vc1_h_loop_filter4_worstcase_c: 184.5 vc1dsp.vc1_h_loop_filter4_worstcase_neon: 73.7 vc1dsp.vc1_h_loop_filter8_bestcase_c: 31.2 vc1dsp.vc1_h_loop_filter8_bestcase_neon: 62.2 vc1dsp.vc1_h_loop_filter8_worstcase_c: 358.2 vc1dsp.vc1_h_loop_filter8_worstcase_neon: 88.2 vc1dsp.vc1_h_loop_filter16_bestcase_c: 51.0 vc1dsp.vc1_h_loop_filter16_bestcase_neon: 107.7 vc1dsp.vc1_h_loop_filter16_worstcase_c: 722.7 vc1dsp.vc1_h_loop_filter16_worstcase_neon: 140.5 vc1dsp.vc1_v_loop_filter4_bestcase_c: 9.7 vc1dsp.vc1_v_loop_filter4_bestcase_neon: 43.0 vc1dsp.vc1_v_loop_filter4_worstcase_c: 178.7 vc1dsp.vc1_v_loop_filter4_worstcase_neon: 69.0 vc1dsp.vc1_v_loop_filter8_bestcase_c: 30.2 vc1dsp.vc1_v_loop_filter8_bestcase_neon: 50.7 vc1dsp.vc1_v_loop_filter8_worstcase_c: 353.0 vc1dsp.vc1_v_loop_filter8_worstcase_neon: 69.2 vc1dsp.vc1_v_loop_filter16_bestcase_c: 60.0 vc1dsp.vc1_v_loop_filter16_bestcase_neon: 90.0 vc1dsp.vc1_v_loop_filter16_worstcase_c: 714.2 vc1dsp.vc1_v_loop_filter16_worstcase_neon: 97.2 Signed-off-by: Ben Avison <bavison@riscosopen.org> Signed-off-by: Martin Storsjö <martin@martin.st>