summaryrefslogtreecommitdiff
path: root/libavcodec/aarch64
Commit message (Collapse)AuthorAge
* avcodec/me_cmp: Constify me_cmp_func buffer parametersAndreas Rheinhardt2022-07-31
| | | | Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
* avcodec/videodsp: Constify buf in VideoDSPContext.prefetchAndreas Rheinhardt2022-07-31
| | | | Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
* aarch64: me_cmp: Don't do uaddlv once per iterationMartin Storsjö2022-07-16
| | | | | | | | | | | | | | | | | | | | The max height is currently documented as 16; the max difference per pixel is 255, and a .8h element can easily contain 16*255, thus keep accumulating in two .8h vectors, and just do the final accumulationat the end. This should work for heights up to 256. This requires a minor register renumbering in ff_pix_abs16_xy2_neon. Before: Cortex A53 A72 A73 Graviton 3 pix_abs_0_0_neon: 97.7 47.0 37.5 22.7 pix_abs_0_1_neon: 154.0 59.0 52.0 25.0 pix_abs_0_3_neon: 179.7 96.7 87.5 41.2 After: pix_abs_0_0_neon: 96.0 39.2 31.2 22.0 pix_abs_0_1_neon: 150.7 59.7 46.2 23.7 pix_abs_0_3_neon: 175.7 83.7 81.7 38.2 Signed-off-by: Martin Storsjö <martin@martin.st>
* aarch64: me_cmp: Switch from uabd to uabal in ff_pix_abs16_xy2_neonMartin Storsjö2022-07-16
| | | | | | | | | | | | | | | | | | Using absolute-difference-accumulate does use twice the amount of absolute-difference instructions, but avoids the need for the uaddl and add instructions, reducing the total number of instructions by 3. These can be interleaved in the rest of the calculation, to avoid tight dependencies at the end. Unfortunately, this is marginally slower on Cortex A53, but faster on A72 and A73. Before: Cortex A53 A72 A73 Graviton 3 pix_abs_0_3_neon: 175.7 109.2 92.0 41.2 After: pix_abs_0_3_neon: 179.7 96.7 87.5 41.2 Signed-off-by: Martin Storsjö <martin@martin.st>
* aarch64: me_cmp: Interleave some of the loads in ff_pix_abs16_xy2_neonMartin Storsjö2022-07-16
| | | | | | | | | Before: Cortex A53 A72 A73 Graviton 3 pix_abs_0_3_neon: 183.7 112.7 97.5 41.2 After: pix_abs_0_3_neon: 175.7 109.2 92.0 41.2 Signed-off-by: Martin Storsjö <martin@martin.st>
* libavcodec: aarch64: Don't clobber v8 in the h%4 case in ff_pix_abs16_xy2_neonMartin Storsjö2022-07-16
| | | | | | Checkasm doesn't currently test this codepath. Signed-off-by: Martin Storsjö <martin@martin.st>
* lavc/aarch64: Add pix_abs16_x2 neon implementationHubert Mazur2022-07-13
| | | | | | | | | | | | | Provide neon implementation for pix_abs16_x2 function. Performance tests of implementation are below. - pix_abs_0_1_c: 283.5 - pix_abs_0_1_neon: 39.0 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>
* lavc/aarch64: Hook up the existing ff_pix_abs16_neon to the sad[0] function ↵Hubert Mazur2022-07-11
| | | | | | | pointer Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>
* lavc/aarch64: motion estimation functions in neonSwinney, Jonathan2022-06-28
| | | | | | | | | | | | | | | | | | | | | | - ff_pix_abs16_neon - ff_pix_abs16_xy2_neon In direct micro benchmarks of these ff functions verses their C implementations, these functions performed as follows on AWS Graviton 3. ff_pix_abs16_neon: pix_abs_0_0_c: 141.1 pix_abs_0_0_neon: 19.6 ff_pix_abs16_xy2_neon: pix_abs_0_3_c: 269.1 pix_abs_0_3_neon: 39.3 Tested with: ./tests/checkasm/checkasm --test=motion --bench --disable-linux-perf Signed-off-by: Jonathan Swinney <jswinney@amazon.com> Signed-off-by: Martin Storsjö <martin@martin.st>
* lavc/aarch64: hevc_sao reschedule slightlyJ. Dekker2022-05-26
| | | | Signed-off-by: J. Dekker <jdek@itanimul.li>
* lavc/aarch64: add hevc sao edge 8x8J. Dekker2022-05-25
| | | | | | | | | bench on AWS Graviton: hevc_sao_edge_8x8_8_c: 516.0 hevc_sao_edge_8x8_8_neon: 81.0 Signed-off-by: J. Dekker <jdek@itanimul.li>
* lavc/aarch64: add hevc sao edge 16x16J. Dekker2022-05-25
| | | | | | | | | | | | | | | bench on AWS Graviton: hevc_sao_edge_16x16_8_c: 1857.0 hevc_sao_edge_16x16_8_neon: 211.0 hevc_sao_edge_32x32_8_c: 7802.2 hevc_sao_edge_32x32_8_neon: 808.2 hevc_sao_edge_48x48_8_c: 16764.2 hevc_sao_edge_48x48_8_neon: 1796.5 hevc_sao_edge_64x64_8_c: 32647.5 hevc_sao_edge_64x64_8_neon: 3118.5 Signed-off-by: J. Dekker <jdek@itanimul.li>
* lavc/aarch64: fix hevc sao band filterJ. Dekker2022-05-25
| | | | | | | The SAO band filter can be called with non-multiples of 8, we round up to the nearest multiple of 8 to account for this. Signed-off-by: J. Dekker <jdek@itanimul.li>
* arm64: Fix wrong BTI landing padAndre Kempe2022-04-26
| | | | | | | | | | | | This patch fixes a wrong type of BTI landing pad when branching to functions instantiated via the fft*_neon macro. Although the previously employed paciasp instruction serves as a landing pad, for the ways that this function is invoked it is the wrong type, resulting in an unexpected termination of the running process. Signed-off-by: André Kempe <andre.kempe@arm.com> Signed-off-by: Martin Storsjö <martin@martin.st>
* avcodec/vc1: Arm 64-bit NEON unescape fast pathBen Avison2022-04-01
| | | | | | | | | | checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. vc1dsp.vc1_unescape_buffer_c: 655617.7 vc1dsp.vc1_unescape_buffer_neon: 118237.0 Signed-off-by: Ben Avison <bavison@riscosopen.org> Signed-off-by: Martin Storsjö <martin@martin.st>
* avcodec/idctdsp: Arm 64-bit NEON block add and clamp fast pathsBen Avison2022-04-01
| | | | | | | | | | | | | | checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. idctdsp.add_pixels_clamped_c: 313.3 idctdsp.add_pixels_clamped_neon: 24.3 idctdsp.put_pixels_clamped_c: 220.3 idctdsp.put_pixels_clamped_neon: 15.5 idctdsp.put_signed_pixels_clamped_c: 210.5 idctdsp.put_signed_pixels_clamped_neon: 19.5 Signed-off-by: Ben Avison <bavison@riscosopen.org> Signed-off-by: Martin Storsjö <martin@martin.st>
* avcodec/vc1: Arm 64-bit NEON inverse transform fast pathsBen Avison2022-04-01
| | | | | | | | | | | | | | | | | | | | | | | | checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. vc1dsp.vc1_inv_trans_4x4_c: 158.2 vc1dsp.vc1_inv_trans_4x4_neon: 65.7 vc1dsp.vc1_inv_trans_4x4_dc_c: 86.5 vc1dsp.vc1_inv_trans_4x4_dc_neon: 26.5 vc1dsp.vc1_inv_trans_4x8_c: 335.2 vc1dsp.vc1_inv_trans_4x8_neon: 106.2 vc1dsp.vc1_inv_trans_4x8_dc_c: 151.2 vc1dsp.vc1_inv_trans_4x8_dc_neon: 25.5 vc1dsp.vc1_inv_trans_8x4_c: 365.7 vc1dsp.vc1_inv_trans_8x4_neon: 97.2 vc1dsp.vc1_inv_trans_8x4_dc_c: 139.7 vc1dsp.vc1_inv_trans_8x4_dc_neon: 16.5 vc1dsp.vc1_inv_trans_8x8_c: 547.7 vc1dsp.vc1_inv_trans_8x8_neon: 137.0 vc1dsp.vc1_inv_trans_8x8_dc_c: 268.2 vc1dsp.vc1_inv_trans_8x8_dc_neon: 30.5 Signed-off-by: Ben Avison <bavison@riscosopen.org> Signed-off-by: Martin Storsjö <martin@martin.st>
* avcodec/vc1: Arm 64-bit NEON deblocking filter fast pathsBen Avison2022-04-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. Note that the C version can still outperform the NEON version in specific cases. The balance between different code paths is stream-dependent, but in practice the best case happens about 5% of the time, the worst case happens about 40% of the time, and the complexity of the remaining cases fall somewhere in between. Therefore, taking the average of the best and worst case timings is probably a conservative estimate of the degree by which the NEON code improves performance. vc1dsp.vc1_h_loop_filter4_bestcase_c: 10.7 vc1dsp.vc1_h_loop_filter4_bestcase_neon: 43.5 vc1dsp.vc1_h_loop_filter4_worstcase_c: 184.5 vc1dsp.vc1_h_loop_filter4_worstcase_neon: 73.7 vc1dsp.vc1_h_loop_filter8_bestcase_c: 31.2 vc1dsp.vc1_h_loop_filter8_bestcase_neon: 62.2 vc1dsp.vc1_h_loop_filter8_worstcase_c: 358.2 vc1dsp.vc1_h_loop_filter8_worstcase_neon: 88.2 vc1dsp.vc1_h_loop_filter16_bestcase_c: 51.0 vc1dsp.vc1_h_loop_filter16_bestcase_neon: 107.7 vc1dsp.vc1_h_loop_filter16_worstcase_c: 722.7 vc1dsp.vc1_h_loop_filter16_worstcase_neon: 140.5 vc1dsp.vc1_v_loop_filter4_bestcase_c: 9.7 vc1dsp.vc1_v_loop_filter4_bestcase_neon: 43.0 vc1dsp.vc1_v_loop_filter4_worstcase_c: 178.7 vc1dsp.vc1_v_loop_filter4_worstcase_neon: 69.0 vc1dsp.vc1_v_loop_filter8_bestcase_c: 30.2 vc1dsp.vc1_v_loop_filter8_bestcase_neon: 50.7 vc1dsp.vc1_v_loop_filter8_worstcase_c: 353.0 vc1dsp.vc1_v_loop_filter8_worstcase_neon: 69.2 vc1dsp.vc1_v_loop_filter16_bestcase_c: 60.0 vc1dsp.vc1_v_loop_filter16_bestcase_neon: 90.0 vc1dsp.vc1_v_loop_filter16_worstcase_c: 714.2 vc1dsp.vc1_v_loop_filter16_worstcase_neon: 97.2 Signed-off-by: Ben Avison <bavison@riscosopen.org> Signed-off-by: Martin Storsjö <martin@martin.st>
* configure: Use a separate config_components.h header for $ALL_COMPONENTSMartin Storsjö2022-03-16
| | | | | | | | This avoids unnecessary rebuilds of most source files if only the list of enabled components has changed, but not the other properties of the build, set in config.h. Signed-off-by: Martin Storsjö <martin@martin.st>
* arm64: Add Armv8.3-A PAC support to assembly filesAndre Kempe2022-03-09
| | | | | | | | | | | This patch adds optional support for Arm Pointer Authentication Codes. PAC support is turned on or off at compile time using additional compiler flags. Unless any of these is enabled explicitly, no additional code will be emitted at all. Signed-off-by: André Kempe <andre.kempe@arm.com> Signed-off-by: Martin Storsjö <martin@martin.st>
* avcodec/aarch64/idct: Add missing stddefAndreas Rheinhardt2022-02-21
| | | | | | Fixes checkheaders on aarch64. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
* aarch64: h264dsp: Fix incorrectly indented codeMartin Storsjö2022-02-11
| | | | Signed-off-by: Martin Storsjö <martin@martin.st>
* aarch64: Disable ff_hevc_sao_band_filter_8x8_8_neon out of precautionMartin Storsjö2022-01-07
| | | | | | | | | | While this function on its own passes all of fate-hevc, there's indications that the function might need to handle widths that aren't a multiple of 8 (noted in commit f63f9be37c799ddc835af358034630d31fb7db02, which later was reverted). Signed-off-by: Martin Storsjö <martin@martin.st>
* Revert "lavc/aarch64: add hevc sao edge 16x16"Martin Storsjö2022-01-07
| | | | | | | This reverts commit a9214a2ca31c9d54f893c5ac4004a5ff30a08d10, as it breaks fate-hevc. Signed-off-by: Martin Storsjö <martin@martin.st>
* Revert "lavc/aarch64: add hevc sao edge 8x8"Martin Storsjö2022-01-07
| | | | | | | This reverts commit c97ffc1a77ccaf901e642bd21ed26aaf75557745, as it breaks fate-hevc. Signed-off-by: Martin Storsjö <martin@martin.st>
* Revert "lavc/aarch64: add hevc sao band 8x8 tiling"Martin Storsjö2022-01-07
| | | | | | | This reverts commit f63f9be37c799ddc835af358034630d31fb7db02, as it breaks fate-hevc. Signed-off-by: Martin Storsjö <martin@martin.st>
* lavc/aarch64: add hevc sao band 8x8 tilingJ. Dekker2022-01-04
| | | | | | | | | | | | | | | | | bench on AWS Graviton: hevc_sao_band_8x8_8_c: 317.5 hevc_sao_band_8x8_8_neon: 97.5 hevc_sao_band_16x16_8_c: 1115.0 hevc_sao_band_16x16_8_neon: 322.7 hevc_sao_band_32x32_8_c: 4599.2 hevc_sao_band_32x32_8_neon: 1246.2 hevc_sao_band_48x48_8_c: 10021.7 hevc_sao_band_48x48_8_neon: 2740.5 hevc_sao_band_64x64_8_c: 17635.0 hevc_sao_band_64x64_8_neon: 4875.7 Signed-off-by: J. Dekker <jdek@itanimul.li>
* lavc/aarch64: clean-up sao band 8x8 function formattingJ. Dekker2022-01-04
| | | | Signed-off-by: J. Dekker <jdek@itanimul.li>
* lavc/aarch64: add hevc sao edge 8x8J. Dekker2022-01-04
| | | | | | | | | bench on AWS Graviton: hevc_sao_edge_8x8_8_c: 516.0 hevc_sao_edge_8x8_8_neon: 81.0 Signed-off-by: J. Dekker <jdek@itanimul.li>
* lavc/aarch64: add hevc sao edge 16x16J. Dekker2022-01-04
| | | | | | | | | | | | | | | bench on AWS Graviton: hevc_sao_edge_16x16_8_c: 1857.0 hevc_sao_edge_16x16_8_neon: 211.0 hevc_sao_edge_32x32_8_c: 7802.2 hevc_sao_edge_32x32_8_neon: 808.2 hevc_sao_edge_48x48_8_c: 16764.2 hevc_sao_edge_48x48_8_neon: 1796.5 hevc_sao_edge_64x64_8_c: 32647.5 hevc_sao_edge_64x64_8_neon: 3118.5 Signed-off-by: J. Dekker <jdek@itanimul.li>
* aarch64: Add Armv8.5-A BTI supportJonathan Wright2021-11-16
| | | | | | | | | | | | | | | | | Add Branch Target Identifiers (BTIs) to all functions defined in AArch64 assembly files. Most of the BTI landing pads are added automatically by the 'function' macro. BTI support is turned on or off at compile time based on the presence of the __ARM_FEATURE_BTI_DEFAULT feature macro. A binary compiled with BTI support can be executed on an Armv8-A processor without BTI support because the instructions are defined in NOP space. Signed-off-by: Jonathan Wright <jonathan.wright@arm.com> Signed-off-by: Elijah Ahmad <elijah.ahmad@arm.com> Signed-off-by: Martin Storsjö <martin@martin.st>
* aarch64: Use ret x<n> instead of br x<n> where possibleJonathan Wright2021-11-16
| | | | | | | | | | | | | | | | | | Change AArch64 assembly code to use: ret x<n> instead of: br x<n> "ret x<n>" is already used in a lot of places so this patch makes it consistent across the code base. This does not change behavior or performance. In addition, this change reduces the number of landing pads needed in a subsequent patch to support the Armv8.5-A Branch Target Identification (BTI) security feature. Signed-off-by: Jonathan Wright <jonathan.wright@arm.com> Signed-off-by: Martin Storsjö <martin@martin.st>
* aarch64: h264qpel: Do vertical filtering without transposingMartin Storsjö2021-10-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This gives rather big speedups on these functions: Before: put_h264_qpel_8_mc01_8_neon: 241.0 131.5 138.7 put_h264_qpel_8_mc02_8_neon: 214.7 121.2 127.5 put_h264_qpel_8_mc03_8_neon: 242.5 131.2 135.7 put_h264_qpel_8_mc11_8_neon: 421.2 218.7 251.0 put_h264_qpel_8_mc12_8_neon: 878.0 509.5 537.5 put_h264_qpel_8_mc13_8_neon: 423.7 217.0 252.0 put_h264_qpel_8_mc21_8_neon: 858.2 479.5 514.0 put_h264_qpel_8_mc22_8_neon: 649.7 385.2 403.0 put_h264_qpel_8_mc23_8_neon: 860.2 476.5 517.7 put_h264_qpel_8_mc31_8_neon: 437.2 219.5 252.5 put_h264_qpel_8_mc32_8_neon: 892.5 510.5 546.0 put_h264_qpel_8_mc33_8_neon: 438.2 218.5 257.0 put_h264_qpel_16_mc01_8_neon: 944.2 509.7 546.7 put_h264_qpel_16_mc02_8_neon: 878.7 469.5 509.7 put_h264_qpel_16_mc03_8_neon: 945.7 510.7 557.0 put_h264_qpel_16_mc11_8_neon: 1663.2 858.5 979.5 put_h264_qpel_16_mc12_8_neon: 3510.2 2027.7 2112.7 put_h264_qpel_16_mc13_8_neon: 1664.7 857.5 980.5 put_h264_qpel_16_mc21_8_neon: 3366.2 1928.5 2030.5 put_h264_qpel_16_mc22_8_neon: 2584.7 1514.7 1590.2 put_h264_qpel_16_mc23_8_neon: 3367.7 1927.7 2035.0 put_h264_qpel_16_mc31_8_neon: 1716.7 849.7 997.0 put_h264_qpel_16_mc32_8_neon: 3564.0 2044.2 3835.2 put_h264_qpel_16_mc33_8_neon: 1717.7 863.0 989.5 After: put_h264_qpel_8_mc01_8_neon: 136.0 73.7 76.0 put_h264_qpel_8_mc02_8_neon: 108.7 65.0 64.0 put_h264_qpel_8_mc03_8_neon: 137.5 72.7 73.0 put_h264_qpel_8_mc11_8_neon: 316.2 159.0 188.5 put_h264_qpel_8_mc12_8_neon: 653.0 375.5 384.7 put_h264_qpel_8_mc13_8_neon: 318.7 165.5 189.5 put_h264_qpel_8_mc21_8_neon: 739.2 385.7 432.5 put_h264_qpel_8_mc22_8_neon: 530.7 295.5 309.5 put_h264_qpel_8_mc23_8_neon: 741.2 393.7 421.0 put_h264_qpel_8_mc31_8_neon: 332.2 162.5 190.0 put_h264_qpel_8_mc32_8_neon: 667.5 378.2 390.5 put_h264_qpel_8_mc33_8_neon: 332.7 166.5 195.5 put_h264_qpel_16_mc01_8_neon: 524.2 285.2 294.0 put_h264_qpel_16_mc02_8_neon: 454.7 252.2 250.2 put_h264_qpel_16_mc03_8_neon: 525.7 286.0 283.0 put_h264_qpel_16_mc11_8_neon: 1243.2 630.7 726.7 put_h264_qpel_16_mc12_8_neon: 2610.2 1479.7 1481.2 put_h264_qpel_16_mc13_8_neon: 1250.5 631.7 727.7 put_h264_qpel_16_mc21_8_neon: 2890.2 1571.2 1679.7 put_h264_qpel_16_mc22_8_neon: 2108.7 1177.5 1223.5 put_h264_qpel_16_mc23_8_neon: 2891.7 1578.7 1667.7 put_h264_qpel_16_mc31_8_neon: 1296.7 630.5 752.5 put_h264_qpel_16_mc32_8_neon: 2664.0 1483.2 1503.5 put_h264_qpel_16_mc33_8_neon: 1297.7 632.5 747.2 I.e. overall a 20%-60% reduction in runtime of these functions. Signed-off-by: Martin Storsjö <martin@martin.st>
* arm/aarch64: Improve scheduling in the avg form of h264_qpelMartin Storsjö2021-10-18
| | | | | | | Don't use the loaded registers directly, avoiding stalls on in order cores. Use vrhadd.u8 with q registers where easily possible. Signed-off-by: Martin Storsjö <martin@martin.st>
* lavc/aarch64: fix relocation out of range errorZhao Zhili2021-09-25
| | | | | | Use a temporary label instead of global function symbol for b.gt. Signed-off-by: Martin Storsjö <martin@martin.st>
* lavc/aarch64: add pred functions for 10-bitMikhail Nitenko2021-08-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Benchmarks: A53 A72 pred8x8_dc_10_c: 64.2 49.5 pred8x8_dc_10_neon: 62.0 53.7 pred8x8_dc_128_10_c: 26.0 14.0 pred8x8_dc_128_10_neon: 30.7 17.5 pred8x8_horizontal_10_c: 60.0 27.7 pred8x8_horizontal_10_neon: 38.0 34.0 pred8x8_left_dc_10_c: 42.5 27.5 pred8x8_left_dc_10_neon: 51.0 41.2 pred8x8_mad_cow_dc_0l0_10_c: 55.7 37.2 pred8x8_mad_cow_dc_0l0_10_neon: 50.2 35.2 pred8x8_mad_cow_dc_0lt_10_c: 89.2 67.0 pred8x8_mad_cow_dc_0lt_10_neon: 52.2 46.7 pred8x8_mad_cow_dc_l0t_10_c: 74.7 51.0 pred8x8_mad_cow_dc_l0t_10_neon: 50.5 45.2 pred8x8_mad_cow_dc_l00_10_c: 58.0 38.0 pred8x8_mad_cow_dc_l00_10_neon: 42.5 37.5 pred8x8_plane_10_c: 354.0 288.7 pred8x8_plane_10_neon: 141.0 101.2 pred8x8_top_dc_10_c: 44.5 30.5 pred8x8_top_dc_10_neon: 40.0 31.0 pred8x8_vertical_10_c: 27.5 14.5 pred8x8_vertical_10_neon: 21.0 17.5 pred16x16_plane_10_c: 1242.0 1070.5 pred16x16_plane_10_neon: 324.0 196.7 Signed-off-by: Mikhail Nitenko <mnitenko@gmail.com> Signed-off-by: Martin Storsjö <martin@martin.st>
* lavc/aarch64: h264, add chroma loop filters for 10bitMikhail Nitenko2021-08-21
| | | | | | | | | | | | | | | | | | | | | | | | | Benchmarks: A53 A72 h264_h_loop_filter_chroma422_10bpp_c: 282.7 114.2 h264_h_loop_filter_chroma422_10bpp_neon: 109.5 78.5 h264_h_loop_filter_chroma_10bpp_c: 165.0 81.5 h264_h_loop_filter_chroma_10bpp_neon: 120.0 76.7 h264_h_loop_filter_chroma_intra422_10bpp_c: 323.7 124.2 h264_h_loop_filter_chroma_intra422_10bpp_neon: 155.0 102.7 h264_h_loop_filter_chroma_intra_10bpp_c: 121.0 49.5 h264_h_loop_filter_chroma_intra_10bpp_neon: 79.7 53.7 h264_h_loop_filter_chroma_mbaff422_10bpp_c: 188.5 75.0 h264_h_loop_filter_chroma_mbaff422_10bpp_neon: 120.0 75.5 h264_h_loop_filter_chroma_mbaff_intra422_10bpp_c: 116.7 46.0 h264_h_loop_filter_chroma_mbaff_intra422_10bpp_neon: 79.7 53.7 h264_h_loop_filter_chroma_mbaff_intra_10bpp_c: 63.0 27.2 h264_h_loop_filter_chroma_mbaff_intra_10bpp_neon: 48.5 34.0 h264_v_loop_filter_chroma_10bpp_c: 258.7 135.5 h264_v_loop_filter_chroma_10bpp_neon: 71.2 51.0 h264_v_loop_filter_chroma_intra_10bpp_c: 158.0 70.7 h264_v_loop_filter_chroma_intra_10bpp_neon: 48.7 31.5 Signed-off-by: Mikhail Nitenko <mnitenko@gmail.com> Signed-off-by: Martin Storsjö <martin@martin.st>
* lavc/aarch64: move transpose_4x8H to neon.SMikhail Nitenko2021-08-21
| | | | | | | | transpose_4x8H was declared in vp9lpf_16bpp_neon, however this macro is not unique to vp9 and could be used elsewhere. Signed-off-by: Mikhail Nitenko <mnitenko@gmail.com> Signed-off-by: Martin Storsjö <martin@martin.st>
* aarch64: h264dsp: Fix indentation of some functions to match the restMartin Storsjö2021-08-08
| | | | Signed-off-by: Martin Storsjö <martin@martin.st>
* aarch64: h264dsp: Remove unnecessary sign extensionsMartin Storsjö2021-08-08
| | | | | | | | | | These became unnecessary when the stride arguments were changed from int to ptrdiff_t in bc26fe89275c267d169b468356c82ee59874407d (0576ef466d8a631326d1d0a5ec2e4c4c81d25353) and d5d699ab6e6f8a8290748d107416fd5c19757a1b (aa844dc46f93182a63ec0b53267d19e7342c79b9). Signed-off-by: Martin Storsjö <martin@martin.st>
* avcodec/h264dsp, h264idct: Fix lengths of array parametersAndreas Rheinhardt2021-08-08
| | | | | | Fixes many -Warray-parameter warnings from GCC 11. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
* aarch64: hevc_idct: Fix overflows in idct_dcMartin Storsjö2021-05-22
| | | | | | | | This is marginally slower, but correct for all input values. The previous implementation failed with certain input seeds, e.g. "checkasm --test=hevc_idct 98". Signed-off-by: Martin Storsjö <martin@martin.st>
* avcodec: Remove deprecated old encode/decode APIsAndreas Rheinhardt2021-04-27
| | | | | | | | Deprecated in commits 7fc329e2dd6226dfecaa4a1d7adf353bf2773726 and 31f6a4b4b83aca1d73f3cfc99ce2b39331970bf3. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com> Signed-off-by: James Almer <jamrial@gmail.com>
* Include attributes.h directlyAndreas Rheinhardt2021-04-19
| | | | | | | | Some files currently rely on libavutil/cpu.h to include it for them; yet said file won't use include it any more after the currently deprecated functions are removed, so include attributes.h directly. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
* lavc/aarch64: add pred16x16 10-bit functionsMikhail Nitenko2021-04-19
| | | | | | | | | | | | | | Benchmarks: A53 A72 pred16x16_dc_10_c: 136.0 124.0 pred16x16_dc_10_neon: 121.2 106.0 pred16x16_horizontal_10_c: 155.0 73.2 pred16x16_horizontal_10_neon: 82.2 67.7 pred16x16_top_dc_10_c: 106.0 93.7 pred16x16_top_dc_10_neon: 87.7 77.2 pred16x16_vertical_10_c: 83.0 67.7 pred16x16_vertical_10_neon: 54.2 61.7 Some functions work slower than C and are left commented out.
* lavc/aarch64: change h264pred_init structureMikhail Nitenko2021-04-19
| | | | Change structure to allow the addition of other bit depths.
* aarch64: h264pred: Optimize the inner loop of existing 8 bit functionsMartin Storsjö2021-04-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the loop counter decrement further from the branch instruction, this hides the latency of the decrement. In loops that first load, then store (the horizontal prediction cases), do the decrement after the load (where the next instruction would stall a bit anyway, waiting for the result of the load). In loops that store twice using the same destination register, also do the decrement between the two stores (as the second store would need to wait for the updated destination register from the first instruction). In loops that store twice to two different destination registers, do the decrement before both stores, to do it as soon before the branch as possible. This gives minor (1-2 cycle) speedups in most cases (modulo measurement noise), but the horizontal prediction functions get a rather notable speedup on the Cortex A53. Before: Cortex A53 A72 A73 pred8x8_dc_8_neon: 60.7 46.2 39.2 pred8x8_dc_128_8_neon: 30.7 18.0 14.0 pred8x8_horizontal_8_neon: 42.2 29.2 18.5 pred8x8_left_dc_8_neon: 52.7 36.2 32.2 pred8x8_mad_cow_dc_0l0_8_neon: 48.2 27.7 25.7 pred8x8_mad_cow_dc_0lt_8_neon: 52.5 33.2 34.7 pred8x8_mad_cow_dc_l0t_8_neon: 52.5 31.7 33.2 pred8x8_mad_cow_dc_l00_8_neon: 43.2 27.0 25.5 pred8x8_plane_8_neon: 112.2 86.2 88.2 pred8x8_top_dc_8_neon: 40.7 23.0 21.2 pred8x8_vertical_8_neon: 27.2 15.5 14.0 pred16x16_dc_8_neon: 91.0 73.2 70.5 pred16x16_dc_128_8_neon: 43.0 34.7 30.7 pred16x16_horizontal_8_neon: 86.0 49.7 44.7 pred16x16_left_dc_8_neon: 87.0 67.2 67.5 pred16x16_plane_8_neon: 236.0 175.7 173.0 pred16x16_top_dc_8_neon: 53.2 39.0 41.7 pred16x16_vertical_8_neon: 41.7 29.7 31.0 After: pred8x8_dc_8_neon: 59.0 46.7 42.5 pred8x8_dc_128_8_neon: 28.2 18.0 14.0 pred8x8_horizontal_8_neon: 34.2 29.2 18.5 pred8x8_left_dc_8_neon: 51.0 38.2 32.7 pred8x8_mad_cow_dc_0l0_8_neon: 46.7 28.2 26.2 pred8x8_mad_cow_dc_0lt_8_neon: 55.2 33.7 37.5 pred8x8_mad_cow_dc_l0t_8_neon: 51.2 31.7 37.2 pred8x8_mad_cow_dc_l00_8_neon: 41.7 27.5 26.0 pred8x8_plane_8_neon: 111.5 86.5 89.5 pred8x8_top_dc_8_neon: 39.0 23.2 21.0 pred8x8_vertical_8_neon: 27.2 16.0 14.0 pred16x16_dc_8_neon: 85.0 70.2 70.5 pred16x16_dc_128_8_neon: 42.0 30.0 30.7 pred16x16_horizontal_8_neon: 66.5 49.5 42.5 pred16x16_left_dc_8_neon: 81.0 66.5 67.5 pred16x16_plane_8_neon: 235.0 175.7 173.0 pred16x16_top_dc_8_neon: 52.0 39.0 41.7 pred16x16_vertical_8_neon: 40.2 33.2 31.0 Despite this, a number of these functions still are slower than what e.g. GCC 7 generates - this shows the relative speedup of the neon codepaths over the compiler generated ones: Cortex A53 A72 A73 pred8x8_dc_8_neon: 0.86 0.65 1.04 pred8x8_dc_128_8_neon: 0.59 0.44 0.62 pred8x8_horizontal_8_neon: 1.51 0.58 1.30 pred8x8_left_dc_8_neon: 0.72 0.56 0.89 pred8x8_mad_cow_dc_0l0_8_neon: 0.93 0.93 1.37 pred8x8_mad_cow_dc_0lt_8_neon: 1.37 1.41 1.68 pred8x8_mad_cow_dc_l0t_8_neon: 1.21 1.17 1.32 pred8x8_mad_cow_dc_l00_8_neon: 1.24 1.19 1.60 pred8x8_plane_8_neon: 3.36 3.58 3.76 pred8x8_top_dc_8_neon: 0.97 0.99 1.43 pred8x8_vertical_8_neon: 0.86 0.78 1.18 pred16x16_dc_8_neon: 1.20 1.06 1.49 pred16x16_dc_128_8_neon: 0.83 0.95 0.99 pred16x16_horizontal_8_neon: 1.78 0.96 1.59 pred16x16_left_dc_8_neon: 1.06 0.96 1.32 pred16x16_plane_8_neon: 5.78 6.49 7.19 pred16x16_top_dc_8_neon: 1.48 1.53 1.94 pred16x16_vertical_8_neon: 1.39 1.34 1.98 In particular, on Cortex A72, many of these functions are slower than the compiler generated code, while they're more beneficial on e.g. the Cortex A73. Signed-off-by: Martin Storsjö <martin@martin.st>
* avcodec: add missing FF_API_OLD_ENCDEC wrappers to xmm clobber functionsJames Almer2021-02-26
| | | | Signed-off-by: James Almer <jamrial@gmail.com>
* lavc/aarch64: add HEVC sao_band NEONJosh Dekker2021-02-18
| | | | | | Only works for 8x8. Signed-off-by: Josh Dekker <josh@itanimul.li>
* lavc/aarch64: add HEVC idct_dc NEONJosh Dekker2021-02-18
| | | | Signed-off-by: Josh Dekker <josh@itanimul.li>