libav.git - [no description]

	Commit message (Collapse)	Author	Age
*	avcodec/me_cmp: Constify me_cmp_func buffer parameters	Andreas Rheinhardt	2022-07-31
\| \| \| \|	Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
*	avcodec/videodsp: Constify buf in VideoDSPContext.prefetch	Andreas Rheinhardt	2022-07-31
\| \| \| \|	Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
*	aarch64: me_cmp: Don't do uaddlv once per iteration	Martin Storsjö	2022-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The max height is currently documented as 16; the max difference per pixel is 255, and a .8h element can easily contain 16*255, thus keep accumulating in two .8h vectors, and just do the final accumulationat the end. This should work for heights up to 256. This requires a minor register renumbering in ff_pix_abs16_xy2_neon. Before: Cortex A53 A72 A73 Graviton 3 pix_abs_0_0_neon: 97.7 47.0 37.5 22.7 pix_abs_0_1_neon: 154.0 59.0 52.0 25.0 pix_abs_0_3_neon: 179.7 96.7 87.5 41.2 After: pix_abs_0_0_neon: 96.0 39.2 31.2 22.0 pix_abs_0_1_neon: 150.7 59.7 46.2 23.7 pix_abs_0_3_neon: 175.7 83.7 81.7 38.2 Signed-off-by: Martin Storsjö <martin@martin.st>
*	aarch64: me_cmp: Switch from uabd to uabal in ff_pix_abs16_xy2_neon	Martin Storsjö	2022-07-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Using absolute-difference-accumulate does use twice the amount of absolute-difference instructions, but avoids the need for the uaddl and add instructions, reducing the total number of instructions by 3. These can be interleaved in the rest of the calculation, to avoid tight dependencies at the end. Unfortunately, this is marginally slower on Cortex A53, but faster on A72 and A73. Before: Cortex A53 A72 A73 Graviton 3 pix_abs_0_3_neon: 175.7 109.2 92.0 41.2 After: pix_abs_0_3_neon: 179.7 96.7 87.5 41.2 Signed-off-by: Martin Storsjö <martin@martin.st>
*	aarch64: me_cmp: Interleave some of the loads in ff_pix_abs16_xy2_neon	Martin Storsjö	2022-07-16
\| \| \| \| \| \| \| \| \|	Before: Cortex A53 A72 A73 Graviton 3 pix_abs_0_3_neon: 183.7 112.7 97.5 41.2 After: pix_abs_0_3_neon: 175.7 109.2 92.0 41.2 Signed-off-by: Martin Storsjö <martin@martin.st>
*	libavcodec: aarch64: Don't clobber v8 in the h%4 case in ff_pix_abs16_xy2_neon	Martin Storsjö	2022-07-16
\| \| \| \| \| \|	Checkasm doesn't currently test this codepath. Signed-off-by: Martin Storsjö <martin@martin.st>
*	lavc/aarch64: Add pix_abs16_x2 neon implementation	Hubert Mazur	2022-07-13
\| \| \| \| \| \| \| \| \| \| \| \| \|	Provide neon implementation for pix_abs16_x2 function. Performance tests of implementation are below. - pix_abs_0_1_c: 283.5 - pix_abs_0_1_neon: 39.0 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>
*	lavc/aarch64: Hook up the existing ff_pix_abs16_neon to the sad[0] function ↵	Hubert Mazur	2022-07-11
\| \| \| \| \| \| \|	pointer Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>
*	lavc/aarch64: motion estimation functions in neon	Swinney, Jonathan	2022-06-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- ff_pix_abs16_neon - ff_pix_abs16_xy2_neon In direct micro benchmarks of these ff functions verses their C implementations, these functions performed as follows on AWS Graviton 3. ff_pix_abs16_neon: pix_abs_0_0_c: 141.1 pix_abs_0_0_neon: 19.6 ff_pix_abs16_xy2_neon: pix_abs_0_3_c: 269.1 pix_abs_0_3_neon: 39.3 Tested with: ./tests/checkasm/checkasm --test=motion --bench --disable-linux-perf Signed-off-by: Jonathan Swinney <jswinney@amazon.com> Signed-off-by: Martin Storsjö <martin@martin.st>
*	lavc/aarch64: hevc_sao reschedule slightly	J. Dekker	2022-05-26
\| \| \| \|	Signed-off-by: J. Dekker <jdek@itanimul.li>
*	lavc/aarch64: add hevc sao edge 8x8	J. Dekker	2022-05-25
\| \| \| \| \| \| \| \| \|	bench on AWS Graviton: hevc_sao_edge_8x8_8_c: 516.0 hevc_sao_edge_8x8_8_neon: 81.0 Signed-off-by: J. Dekker <jdek@itanimul.li>
*	lavc/aarch64: add hevc sao edge 16x16	J. Dekker	2022-05-25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	bench on AWS Graviton: hevc_sao_edge_16x16_8_c: 1857.0 hevc_sao_edge_16x16_8_neon: 211.0 hevc_sao_edge_32x32_8_c: 7802.2 hevc_sao_edge_32x32_8_neon: 808.2 hevc_sao_edge_48x48_8_c: 16764.2 hevc_sao_edge_48x48_8_neon: 1796.5 hevc_sao_edge_64x64_8_c: 32647.5 hevc_sao_edge_64x64_8_neon: 3118.5 Signed-off-by: J. Dekker <jdek@itanimul.li>
*	lavc/aarch64: fix hevc sao band filter	J. Dekker	2022-05-25
\| \| \| \| \| \| \|	The SAO band filter can be called with non-multiples of 8, we round up to the nearest multiple of 8 to account for this. Signed-off-by: J. Dekker <jdek@itanimul.li>
*	arm64: Fix wrong BTI landing pad	Andre Kempe	2022-04-26
\| \| \| \| \| \| \| \| \| \| \| \|	This patch fixes a wrong type of BTI landing pad when branching to functions instantiated via the fft*_neon macro. Although the previously employed paciasp instruction serves as a landing pad, for the ways that this function is invoked it is the wrong type, resulting in an unexpected termination of the running process. Signed-off-by: André Kempe <andre.kempe@arm.com> Signed-off-by: Martin Storsjö <martin@martin.st>
*	avcodec/vc1: Arm 64-bit NEON unescape fast path	Ben Avison	2022-04-01
\| \| \| \| \| \| \| \| \| \|	checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. vc1dsp.vc1_unescape_buffer_c: 655617.7 vc1dsp.vc1_unescape_buffer_neon: 118237.0 Signed-off-by: Ben Avison <bavison@riscosopen.org> Signed-off-by: Martin Storsjö <martin@martin.st>
*	avcodec/idctdsp: Arm 64-bit NEON block add and clamp fast paths	Ben Avison	2022-04-01
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. idctdsp.add_pixels_clamped_c: 313.3 idctdsp.add_pixels_clamped_neon: 24.3 idctdsp.put_pixels_clamped_c: 220.3 idctdsp.put_pixels_clamped_neon: 15.5 idctdsp.put_signed_pixels_clamped_c: 210.5 idctdsp.put_signed_pixels_clamped_neon: 19.5 Signed-off-by: Ben Avison <bavison@riscosopen.org> Signed-off-by: Martin Storsjö <martin@martin.st>
*	avcodec/vc1: Arm 64-bit NEON inverse transform fast paths	Ben Avison	2022-04-01
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. vc1dsp.vc1_inv_trans_4x4_c: 158.2 vc1dsp.vc1_inv_trans_4x4_neon: 65.7 vc1dsp.vc1_inv_trans_4x4_dc_c: 86.5 vc1dsp.vc1_inv_trans_4x4_dc_neon: 26.5 vc1dsp.vc1_inv_trans_4x8_c: 335.2 vc1dsp.vc1_inv_trans_4x8_neon: 106.2 vc1dsp.vc1_inv_trans_4x8_dc_c: 151.2 vc1dsp.vc1_inv_trans_4x8_dc_neon: 25.5 vc1dsp.vc1_inv_trans_8x4_c: 365.7 vc1dsp.vc1_inv_trans_8x4_neon: 97.2 vc1dsp.vc1_inv_trans_8x4_dc_c: 139.7 vc1dsp.vc1_inv_trans_8x4_dc_neon: 16.5 vc1dsp.vc1_inv_trans_8x8_c: 547.7 vc1dsp.vc1_inv_trans_8x8_neon: 137.0 vc1dsp.vc1_inv_trans_8x8_dc_c: 268.2 vc1dsp.vc1_inv_trans_8x8_dc_neon: 30.5 Signed-off-by: Ben Avison <bavison@riscosopen.org> Signed-off-by: Martin Storsjö <martin@martin.st>
*	avcodec/vc1: Arm 64-bit NEON deblocking filter fast paths	Ben Avison	2022-04-01
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. Note that the C version can still outperform the NEON version in specific cases. The balance between different code paths is stream-dependent, but in practice the best case happens about 5% of the time, the worst case happens about 40% of the time, and the complexity of the remaining cases fall somewhere in between. Therefore, taking the average of the best and worst case timings is probably a conservative estimate of the degree by which the NEON code improves performance. vc1dsp.vc1_h_loop_filter4_bestcase_c: 10.7 vc1dsp.vc1_h_loop_filter4_bestcase_neon: 43.5 vc1dsp.vc1_h_loop_filter4_worstcase_c: 184.5 vc1dsp.vc1_h_loop_filter4_worstcase_neon: 73.7 vc1dsp.vc1_h_loop_filter8_bestcase_c: 31.2 vc1dsp.vc1_h_loop_filter8_bestcase_neon: 62.2 vc1dsp.vc1_h_loop_filter8_worstcase_c: 358.2 vc1dsp.vc1_h_loop_filter8_worstcase_neon: 88.2 vc1dsp.vc1_h_loop_filter16_bestcase_c: 51.0 vc1dsp.vc1_h_loop_filter16_bestcase_neon: 107.7 vc1dsp.vc1_h_loop_filter16_worstcase_c: 722.7 vc1dsp.vc1_h_loop_filter16_worstcase_neon: 140.5 vc1dsp.vc1_v_loop_filter4_bestcase_c: 9.7 vc1dsp.vc1_v_loop_filter4_bestcase_neon: 43.0 vc1dsp.vc1_v_loop_filter4_worstcase_c: 178.7 vc1dsp.vc1_v_loop_filter4_worstcase_neon: 69.0 vc1dsp.vc1_v_loop_filter8_bestcase_c: 30.2 vc1dsp.vc1_v_loop_filter8_bestcase_neon: 50.7 vc1dsp.vc1_v_loop_filter8_worstcase_c: 353.0 vc1dsp.vc1_v_loop_filter8_worstcase_neon: 69.2 vc1dsp.vc1_v_loop_filter16_bestcase_c: 60.0 vc1dsp.vc1_v_loop_filter16_bestcase_neon: 90.0 vc1dsp.vc1_v_loop_filter16_worstcase_c: 714.2 vc1dsp.vc1_v_loop_filter16_worstcase_neon: 97.2 Signed-off-by: Ben Avison <bavison@riscosopen.org> Signed-off-by: Martin Storsjö <martin@martin.st>
*	configure: Use a separate config_components.h header for $ALL_COMPONENTS	Martin Storsjö	2022-03-16
\| \| \| \| \| \| \| \|	This avoids unnecessary rebuilds of most source files if only the list of enabled components has changed, but not the other properties of the build, set in config.h. Signed-off-by: Martin Storsjö <martin@martin.st>
*	arm64: Add Armv8.3-A PAC support to assembly files	Andre Kempe	2022-03-09
\| \| \| \| \| \| \| \| \| \| \|	This patch adds optional support for Arm Pointer Authentication Codes. PAC support is turned on or off at compile time using additional compiler flags. Unless any of these is enabled explicitly, no additional code will be emitted at all. Signed-off-by: André Kempe <andre.kempe@arm.com> Signed-off-by: Martin Storsjö <martin@martin.st>
*	avcodec/aarch64/idct: Add missing stddef	Andreas Rheinhardt	2022-02-21
\| \| \| \| \| \|	Fixes checkheaders on aarch64. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
*	aarch64: h264dsp: Fix incorrectly indented code	Martin Storsjö	2022-02-11
\| \| \| \|	Signed-off-by: Martin Storsjö <martin@martin.st>
*	aarch64: Disable ff_hevc_sao_band_filter_8x8_8_neon out of precaution	Martin Storsjö	2022-01-07
\| \| \| \| \| \| \| \| \| \|	While this function on its own passes all of fate-hevc, there's indications that the function might need to handle widths that aren't a multiple of 8 (noted in commit f63f9be37c799ddc835af358034630d31fb7db02, which later was reverted). Signed-off-by: Martin Storsjö <martin@martin.st>
*	Revert "lavc/aarch64: add hevc sao edge 16x16"	Martin Storsjö	2022-01-07
\| \| \| \| \| \| \|	This reverts commit a9214a2ca31c9d54f893c5ac4004a5ff30a08d10, as it breaks fate-hevc. Signed-off-by: Martin Storsjö <martin@martin.st>
*	Revert "lavc/aarch64: add hevc sao edge 8x8"	Martin Storsjö	2022-01-07
\| \| \| \| \| \| \|	This reverts commit c97ffc1a77ccaf901e642bd21ed26aaf75557745, as it breaks fate-hevc. Signed-off-by: Martin Storsjö <martin@martin.st>
*	Revert "lavc/aarch64: add hevc sao band 8x8 tiling"	Martin Storsjö	2022-01-07
\| \| \| \| \| \| \|	This reverts commit f63f9be37c799ddc835af358034630d31fb7db02, as it breaks fate-hevc. Signed-off-by: Martin Storsjö <martin@martin.st>
*	lavc/aarch64: add hevc sao band 8x8 tiling	J. Dekker	2022-01-04
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	bench on AWS Graviton: hevc_sao_band_8x8_8_c: 317.5 hevc_sao_band_8x8_8_neon: 97.5 hevc_sao_band_16x16_8_c: 1115.0 hevc_sao_band_16x16_8_neon: 322.7 hevc_sao_band_32x32_8_c: 4599.2 hevc_sao_band_32x32_8_neon: 1246.2 hevc_sao_band_48x48_8_c: 10021.7 hevc_sao_band_48x48_8_neon: 2740.5 hevc_sao_band_64x64_8_c: 17635.0 hevc_sao_band_64x64_8_neon: 4875.7 Signed-off-by: J. Dekker <jdek@itanimul.li>
*	lavc/aarch64: clean-up sao band 8x8 function formatting	J. Dekker	2022-01-04
\| \| \| \|	Signed-off-by: J. Dekker <jdek@itanimul.li>
*	lavc/aarch64: add hevc sao edge 8x8	J. Dekker	2022-01-04
\| \| \| \| \| \| \| \| \|	bench on AWS Graviton: hevc_sao_edge_8x8_8_c: 516.0 hevc_sao_edge_8x8_8_neon: 81.0 Signed-off-by: J. Dekker <jdek@itanimul.li>
*	lavc/aarch64: add hevc sao edge 16x16	J. Dekker	2022-01-04
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	bench on AWS Graviton: hevc_sao_edge_16x16_8_c: 1857.0 hevc_sao_edge_16x16_8_neon: 211.0 hevc_sao_edge_32x32_8_c: 7802.2 hevc_sao_edge_32x32_8_neon: 808.2 hevc_sao_edge_48x48_8_c: 16764.2 hevc_sao_edge_48x48_8_neon: 1796.5 hevc_sao_edge_64x64_8_c: 32647.5 hevc_sao_edge_64x64_8_neon: 3118.5 Signed-off-by: J. Dekker <jdek@itanimul.li>
*	aarch64: Add Armv8.5-A BTI support	Jonathan Wright	2021-11-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add Branch Target Identifiers (BTIs) to all functions defined in AArch64 assembly files. Most of the BTI landing pads are added automatically by the 'function' macro. BTI support is turned on or off at compile time based on the presence of the __ARM_FEATURE_BTI_DEFAULT feature macro. A binary compiled with BTI support can be executed on an Armv8-A processor without BTI support because the instructions are defined in NOP space. Signed-off-by: Jonathan Wright <jonathan.wright@arm.com> Signed-off-by: Elijah Ahmad <elijah.ahmad@arm.com> Signed-off-by: Martin Storsjö <martin@martin.st>
*	aarch64: Use ret x<n> instead of br x<n> where possible	Jonathan Wright	2021-11-16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Change AArch64 assembly code to use: ret x<n> instead of: br x<n> "ret x<n>" is already used in a lot of places so this patch makes it consistent across the code base. This does not change behavior or performance. In addition, this change reduces the number of landing pads needed in a subsequent patch to support the Armv8.5-A Branch Target Identification (BTI) security feature. Signed-off-by: Jonathan Wright <jonathan.wright@arm.com> Signed-off-by: Martin Storsjö <martin@martin.st>
*	aarch64: h264qpel: Do vertical filtering without transposing	Martin Storsjö	2021-10-18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This gives rather big speedups on these functions: Before: put_h264_qpel_8_mc01_8_neon: 241.0 131.5 138.7 put_h264_qpel_8_mc02_8_neon: 214.7 121.2 127.5 put_h264_qpel_8_mc03_8_neon: 242.5 131.2 135.7 put_h264_qpel_8_mc11_8_neon: 421.2 218.7 251.0 put_h264_qpel_8_mc12_8_neon: 878.0 509.5 537.5 put_h264_qpel_8_mc13_8_neon: 423.7 217.0 252.0 put_h264_qpel_8_mc21_8_neon: 858.2 479.5 514.0 put_h264_qpel_8_mc22_8_neon: 649.7 385.2 403.0 put_h264_qpel_8_mc23_8_neon: 860.2 476.5 517.7 put_h264_qpel_8_mc31_8_neon: 437.2 219.5 252.5 put_h264_qpel_8_mc32_8_neon: 892.5 510.5 546.0 put_h264_qpel_8_mc33_8_neon: 438.2 218.5 257.0 put_h264_qpel_16_mc01_8_neon: 944.2 509.7 546.7 put_h264_qpel_16_mc02_8_neon: 878.7 469.5 509.7 put_h264_qpel_16_mc03_8_neon: 945.7 510.7 557.0 put_h264_qpel_16_mc11_8_neon: 1663.2 858.5 979.5 put_h264_qpel_16_mc12_8_neon: 3510.2 2027.7 2112.7 put_h264_qpel_16_mc13_8_neon: 1664.7 857.5 980.5 put_h264_qpel_16_mc21_8_neon: 3366.2 1928.5 2030.5 put_h264_qpel_16_mc22_8_neon: 2584.7 1514.7 1590.2 put_h264_qpel_16_mc23_8_neon: 3367.7 1927.7 2035.0 put_h264_qpel_16_mc31_8_neon: 1716.7 849.7 997.0 put_h264_qpel_16_mc32_8_neon: 3564.0 2044.2 3835.2 put_h264_qpel_16_mc33_8_neon: 1717.7 863.0 989.5 After: put_h264_qpel_8_mc01_8_neon: 136.0 73.7 76.0 put_h264_qpel_8_mc02_8_neon: 108.7 65.0 64.0 put_h264_qpel_8_mc03_8_neon: 137.5 72.7 73.0 put_h264_qpel_8_mc11_8_neon: 316.2 159.0 188.5 put_h264_qpel_8_mc12_8_neon: 653.0 375.5 384.7 put_h264_qpel_8_mc13_8_neon: 318.7 165.5 189.5 put_h264_qpel_8_mc21_8_neon: 739.2 385.7 432.5 put_h264_qpel_8_mc22_8_neon: 530.7 295.5 309.5 put_h264_qpel_8_mc23_8_neon: 741.2 393.7 421.0 put_h264_qpel_8_mc31_8_neon: 332.2 162.5 190.0 put_h264_qpel_8_mc32_8_neon: 667.5 378.2 390.5 put_h264_qpel_8_mc33_8_neon: 332.7 166.5 195.5 put_h264_qpel_16_mc01_8_neon: 524.2 285.2 294.0 put_h264_qpel_16_mc02_8_neon: 454.7 252.2 250.2 put_h264_qpel_16_mc03_8_neon: 525.7 286.0 283.0 put_h264_qpel_16_mc11_8_neon: 1243.2 630.7 726.7 put_h264_qpel_16_mc12_8_neon: 2610.2 1479.7 1481.2 put_h264_qpel_16_mc13_8_neon: 1250.5 631.7 727.7 put_h264_qpel_16_mc21_8_neon: 2890.2 1571.2 1679.7 put_h264_qpel_16_mc22_8_neon: 2108.7 1177.5 1223.5 put_h264_qpel_16_mc23_8_neon: 2891.7 1578.7 1667.7 put_h264_qpel_16_mc31_8_neon: 1296.7 630.5 752.5 put_h264_qpel_16_mc32_8_neon: 2664.0 1483.2 1503.5 put_h264_qpel_16_mc33_8_neon: 1297.7 632.5 747.2 I.e. overall a 20%-60% reduction in runtime of these functions. Signed-off-by: Martin Storsjö <martin@martin.st>
*	arm/aarch64: Improve scheduling in the avg form of h264_qpel	Martin Storsjö	2021-10-18
\| \| \| \| \| \| \|	Don't use the loaded registers directly, avoiding stalls on in order cores. Use vrhadd.u8 with q registers where easily possible. Signed-off-by: Martin Storsjö <martin@martin.st>
*	lavc/aarch64: fix relocation out of range error	Zhao Zhili	2021-09-25
\| \| \| \| \| \|	Use a temporary label instead of global function symbol for b.gt. Signed-off-by: Martin Storsjö <martin@martin.st>
*	lavc/aarch64: add pred functions for 10-bit	Mikhail Nitenko	2021-08-21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Benchmarks: A53 A72 pred8x8_dc_10_c: 64.2 49.5 pred8x8_dc_10_neon: 62.0 53.7 pred8x8_dc_128_10_c: 26.0 14.0 pred8x8_dc_128_10_neon: 30.7 17.5 pred8x8_horizontal_10_c: 60.0 27.7 pred8x8_horizontal_10_neon: 38.0 34.0 pred8x8_left_dc_10_c: 42.5 27.5 pred8x8_left_dc_10_neon: 51.0 41.2 pred8x8_mad_cow_dc_0l0_10_c: 55.7 37.2 pred8x8_mad_cow_dc_0l0_10_neon: 50.2 35.2 pred8x8_mad_cow_dc_0lt_10_c: 89.2 67.0 pred8x8_mad_cow_dc_0lt_10_neon: 52.2 46.7 pred8x8_mad_cow_dc_l0t_10_c: 74.7 51.0 pred8x8_mad_cow_dc_l0t_10_neon: 50.5 45.2 pred8x8_mad_cow_dc_l00_10_c: 58.0 38.0 pred8x8_mad_cow_dc_l00_10_neon: 42.5 37.5 pred8x8_plane_10_c: 354.0 288.7 pred8x8_plane_10_neon: 141.0 101.2 pred8x8_top_dc_10_c: 44.5 30.5 pred8x8_top_dc_10_neon: 40.0 31.0 pred8x8_vertical_10_c: 27.5 14.5 pred8x8_vertical_10_neon: 21.0 17.5 pred16x16_plane_10_c: 1242.0 1070.5 pred16x16_plane_10_neon: 324.0 196.7 Signed-off-by: Mikhail Nitenko <mnitenko@gmail.com> Signed-off-by: Martin Storsjö <martin@martin.st>
*	lavc/aarch64: h264, add chroma loop filters for 10bit	Mikhail Nitenko	2021-08-21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Benchmarks: A53 A72 h264_h_loop_filter_chroma422_10bpp_c: 282.7 114.2 h264_h_loop_filter_chroma422_10bpp_neon: 109.5 78.5 h264_h_loop_filter_chroma_10bpp_c: 165.0 81.5 h264_h_loop_filter_chroma_10bpp_neon: 120.0 76.7 h264_h_loop_filter_chroma_intra422_10bpp_c: 323.7 124.2 h264_h_loop_filter_chroma_intra422_10bpp_neon: 155.0 102.7 h264_h_loop_filter_chroma_intra_10bpp_c: 121.0 49.5 h264_h_loop_filter_chroma_intra_10bpp_neon: 79.7 53.7 h264_h_loop_filter_chroma_mbaff422_10bpp_c: 188.5 75.0 h264_h_loop_filter_chroma_mbaff422_10bpp_neon: 120.0 75.5 h264_h_loop_filter_chroma_mbaff_intra422_10bpp_c: 116.7 46.0 h264_h_loop_filter_chroma_mbaff_intra422_10bpp_neon: 79.7 53.7 h264_h_loop_filter_chroma_mbaff_intra_10bpp_c: 63.0 27.2 h264_h_loop_filter_chroma_mbaff_intra_10bpp_neon: 48.5 34.0 h264_v_loop_filter_chroma_10bpp_c: 258.7 135.5 h264_v_loop_filter_chroma_10bpp_neon: 71.2 51.0 h264_v_loop_filter_chroma_intra_10bpp_c: 158.0 70.7 h264_v_loop_filter_chroma_intra_10bpp_neon: 48.7 31.5 Signed-off-by: Mikhail Nitenko <mnitenko@gmail.com> Signed-off-by: Martin Storsjö <martin@martin.st>
*	lavc/aarch64: move transpose_4x8H to neon.S	Mikhail Nitenko	2021-08-21
\| \| \| \| \| \| \| \|	transpose_4x8H was declared in vp9lpf_16bpp_neon, however this macro is not unique to vp9 and could be used elsewhere. Signed-off-by: Mikhail Nitenko <mnitenko@gmail.com> Signed-off-by: Martin Storsjö <martin@martin.st>
*	aarch64: h264dsp: Fix indentation of some functions to match the rest	Martin Storsjö	2021-08-08
\| \| \| \|	Signed-off-by: Martin Storsjö <martin@martin.st>
*	aarch64: h264dsp: Remove unnecessary sign extensions	Martin Storsjö	2021-08-08
\| \| \| \| \| \| \| \| \| \|	These became unnecessary when the stride arguments were changed from int to ptrdiff_t in bc26fe89275c267d169b468356c82ee59874407d (0576ef466d8a631326d1d0a5ec2e4c4c81d25353) and d5d699ab6e6f8a8290748d107416fd5c19757a1b (aa844dc46f93182a63ec0b53267d19e7342c79b9). Signed-off-by: Martin Storsjö <martin@martin.st>
*	avcodec/h264dsp, h264idct: Fix lengths of array parameters	Andreas Rheinhardt	2021-08-08
\| \| \| \| \| \|	Fixes many -Warray-parameter warnings from GCC 11. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
*	aarch64: hevc_idct: Fix overflows in idct_dc	Martin Storsjö	2021-05-22
\| \| \| \| \| \| \| \|	This is marginally slower, but correct for all input values. The previous implementation failed with certain input seeds, e.g. "checkasm --test=hevc_idct 98". Signed-off-by: Martin Storsjö <martin@martin.st>
*	avcodec: Remove deprecated old encode/decode APIs	Andreas Rheinhardt	2021-04-27
\| \| \| \| \| \| \| \|	Deprecated in commits 7fc329e2dd6226dfecaa4a1d7adf353bf2773726 and 31f6a4b4b83aca1d73f3cfc99ce2b39331970bf3. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com> Signed-off-by: James Almer <jamrial@gmail.com>
*	Include attributes.h directly	Andreas Rheinhardt	2021-04-19
\| \| \| \| \| \| \| \|	Some files currently rely on libavutil/cpu.h to include it for them; yet said file won't use include it any more after the currently deprecated functions are removed, so include attributes.h directly. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
*	lavc/aarch64: add pred16x16 10-bit functions	Mikhail Nitenko	2021-04-19
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Benchmarks: A53 A72 pred16x16_dc_10_c: 136.0 124.0 pred16x16_dc_10_neon: 121.2 106.0 pred16x16_horizontal_10_c: 155.0 73.2 pred16x16_horizontal_10_neon: 82.2 67.7 pred16x16_top_dc_10_c: 106.0 93.7 pred16x16_top_dc_10_neon: 87.7 77.2 pred16x16_vertical_10_c: 83.0 67.7 pred16x16_vertical_10_neon: 54.2 61.7 Some functions work slower than C and are left commented out.
*	lavc/aarch64: change h264pred_init structure	Mikhail Nitenko	2021-04-19
\| \| \| \|	Change structure to allow the addition of other bit depths.
*	aarch64: h264pred: Optimize the inner loop of existing 8 bit functions	Martin Storsjö	2021-04-14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Move the loop counter decrement further from the branch instruction, this hides the latency of the decrement. In loops that first load, then store (the horizontal prediction cases), do the decrement after the load (where the next instruction would stall a bit anyway, waiting for the result of the load). In loops that store twice using the same destination register, also do the decrement between the two stores (as the second store would need to wait for the updated destination register from the first instruction). In loops that store twice to two different destination registers, do the decrement before both stores, to do it as soon before the branch as possible. This gives minor (1-2 cycle) speedups in most cases (modulo measurement noise), but the horizontal prediction functions get a rather notable speedup on the Cortex A53. Before: Cortex A53 A72 A73 pred8x8_dc_8_neon: 60.7 46.2 39.2 pred8x8_dc_128_8_neon: 30.7 18.0 14.0 pred8x8_horizontal_8_neon: 42.2 29.2 18.5 pred8x8_left_dc_8_neon: 52.7 36.2 32.2 pred8x8_mad_cow_dc_0l0_8_neon: 48.2 27.7 25.7 pred8x8_mad_cow_dc_0lt_8_neon: 52.5 33.2 34.7 pred8x8_mad_cow_dc_l0t_8_neon: 52.5 31.7 33.2 pred8x8_mad_cow_dc_l00_8_neon: 43.2 27.0 25.5 pred8x8_plane_8_neon: 112.2 86.2 88.2 pred8x8_top_dc_8_neon: 40.7 23.0 21.2 pred8x8_vertical_8_neon: 27.2 15.5 14.0 pred16x16_dc_8_neon: 91.0 73.2 70.5 pred16x16_dc_128_8_neon: 43.0 34.7 30.7 pred16x16_horizontal_8_neon: 86.0 49.7 44.7 pred16x16_left_dc_8_neon: 87.0 67.2 67.5 pred16x16_plane_8_neon: 236.0 175.7 173.0 pred16x16_top_dc_8_neon: 53.2 39.0 41.7 pred16x16_vertical_8_neon: 41.7 29.7 31.0 After: pred8x8_dc_8_neon: 59.0 46.7 42.5 pred8x8_dc_128_8_neon: 28.2 18.0 14.0 pred8x8_horizontal_8_neon: 34.2 29.2 18.5 pred8x8_left_dc_8_neon: 51.0 38.2 32.7 pred8x8_mad_cow_dc_0l0_8_neon: 46.7 28.2 26.2 pred8x8_mad_cow_dc_0lt_8_neon: 55.2 33.7 37.5 pred8x8_mad_cow_dc_l0t_8_neon: 51.2 31.7 37.2 pred8x8_mad_cow_dc_l00_8_neon: 41.7 27.5 26.0 pred8x8_plane_8_neon: 111.5 86.5 89.5 pred8x8_top_dc_8_neon: 39.0 23.2 21.0 pred8x8_vertical_8_neon: 27.2 16.0 14.0 pred16x16_dc_8_neon: 85.0 70.2 70.5 pred16x16_dc_128_8_neon: 42.0 30.0 30.7 pred16x16_horizontal_8_neon: 66.5 49.5 42.5 pred16x16_left_dc_8_neon: 81.0 66.5 67.5 pred16x16_plane_8_neon: 235.0 175.7 173.0 pred16x16_top_dc_8_neon: 52.0 39.0 41.7 pred16x16_vertical_8_neon: 40.2 33.2 31.0 Despite this, a number of these functions still are slower than what e.g. GCC 7 generates - this shows the relative speedup of the neon codepaths over the compiler generated ones: Cortex A53 A72 A73 pred8x8_dc_8_neon: 0.86 0.65 1.04 pred8x8_dc_128_8_neon: 0.59 0.44 0.62 pred8x8_horizontal_8_neon: 1.51 0.58 1.30 pred8x8_left_dc_8_neon: 0.72 0.56 0.89 pred8x8_mad_cow_dc_0l0_8_neon: 0.93 0.93 1.37 pred8x8_mad_cow_dc_0lt_8_neon: 1.37 1.41 1.68 pred8x8_mad_cow_dc_l0t_8_neon: 1.21 1.17 1.32 pred8x8_mad_cow_dc_l00_8_neon: 1.24 1.19 1.60 pred8x8_plane_8_neon: 3.36 3.58 3.76 pred8x8_top_dc_8_neon: 0.97 0.99 1.43 pred8x8_vertical_8_neon: 0.86 0.78 1.18 pred16x16_dc_8_neon: 1.20 1.06 1.49 pred16x16_dc_128_8_neon: 0.83 0.95 0.99 pred16x16_horizontal_8_neon: 1.78 0.96 1.59 pred16x16_left_dc_8_neon: 1.06 0.96 1.32 pred16x16_plane_8_neon: 5.78 6.49 7.19 pred16x16_top_dc_8_neon: 1.48 1.53 1.94 pred16x16_vertical_8_neon: 1.39 1.34 1.98 In particular, on Cortex A72, many of these functions are slower than the compiler generated code, while they're more beneficial on e.g. the Cortex A73. Signed-off-by: Martin Storsjö <martin@martin.st>
*	avcodec: add missing FF_API_OLD_ENCDEC wrappers to xmm clobber functions	James Almer	2021-02-26
\| \| \| \|	Signed-off-by: James Almer <jamrial@gmail.com>
*	lavc/aarch64: add HEVC sao_band NEON	Josh Dekker	2021-02-18
\| \| \| \| \| \|	Only works for 8x8. Signed-off-by: Josh Dekker <josh@itanimul.li>
*	lavc/aarch64: add HEVC idct_dc NEON	Josh Dekker	2021-02-18
\| \| \| \|	Signed-off-by: Josh Dekker <josh@itanimul.li>