summaryrefslogtreecommitdiff
path: root/libavcodec/aarch64
Commit message (Collapse)AuthorAge
* lavc/aarch64/simple_idct: fix iOS build without gas-preprocessorMatthieu Bouron2017-05-11
| | | | | | | | | | Separates macro arguments with commas and passes .4H/.8H as macro arguments instead of 4H/8H (the later form being interpreted as an hexadecimal value). Fixes ticket #6324. Suggested-by: Martin Storsjö <martin@martin.st>
* Merge commit '2425d7329fdccfa9954faba748f3865151354f0c'Clément Bœsch2017-04-26
|\ | | | | | | | | | | | | * commit '2425d7329fdccfa9954faba748f3865151354f0c': arm64: replace 'bic' with immediate with 'and' with inverted immediate Merged-by: Clément Bœsch <u@pkh.me>
| * arm64: replace 'bic' with immediate with 'and' with inverted immediateJanne Grunau2016-12-14
| | | | | | | | | | | | The former is not an official pseudo instruction although gas and llvm's internal assembler support it. Fixes a build error with xcode 6.2 reported by Memphiz on github.
| * aarch64: vp9itxfm: Skip empty slices in the first pass of idct_idct 16x16 ↵Martin Storsjö2016-11-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | and 32x32 This work is sponsored by, and copyright, Google. Previously all subpartitions except the eob=1 (DC) case ran with the same runtime: vp9_inv_dct_dct_16x16_sub16_add_neon: 1373.2 vp9_inv_dct_dct_32x32_sub32_add_neon: 8089.0 By skipping individual 8x16 or 8x32 pixel slices in the first pass, we reduce the runtime of these functions like this: vp9_inv_dct_dct_16x16_sub1_add_neon: 235.3 vp9_inv_dct_dct_16x16_sub2_add_neon: 1036.7 vp9_inv_dct_dct_16x16_sub4_add_neon: 1036.7 vp9_inv_dct_dct_16x16_sub8_add_neon: 1036.7 vp9_inv_dct_dct_16x16_sub12_add_neon: 1372.1 vp9_inv_dct_dct_16x16_sub16_add_neon: 1372.1 vp9_inv_dct_dct_32x32_sub1_add_neon: 555.1 vp9_inv_dct_dct_32x32_sub2_add_neon: 5190.2 vp9_inv_dct_dct_32x32_sub4_add_neon: 5180.0 vp9_inv_dct_dct_32x32_sub8_add_neon: 5183.1 vp9_inv_dct_dct_32x32_sub12_add_neon: 6161.5 vp9_inv_dct_dct_32x32_sub16_add_neon: 6155.5 vp9_inv_dct_dct_32x32_sub20_add_neon: 7136.3 vp9_inv_dct_dct_32x32_sub24_add_neon: 7128.4 vp9_inv_dct_dct_32x32_sub28_add_neon: 8098.9 vp9_inv_dct_dct_32x32_sub32_add_neon: 8098.8 I.e. in general a very minor overhead for the full subpartition case due to the additional cmps, but a significant speedup for the cases when we only need to process a small part of the actual input data. Signed-off-by: Martin Storsjö <martin@martin.st>
| * aarch64: vp9itxfm: Don't repeatedly set x9 when nothing overwrites itMartin Storsjö2016-11-24
| | | | | | | | Signed-off-by: Martin Storsjö <martin@martin.st>
| * arm/aarch64: vp9itxfm: Fix indentation of macro argumentsMartin Storsjö2016-11-23
| | | | | | | | Signed-off-by: Martin Storsjö <martin@martin.st>
| * aarch64: vp9itxfm: Use w3 instead of x3 for the int eob parameterMartin Storsjö2016-11-18
| | | | | | | | | | | | | | | | The clobbering tests in checkasm are only invoked when testing correctness, so this bug didn't show up when benchmarking the dc-only version. Signed-off-by: Martin Storsjö <martin@martin.st>
| * aarch64: vp9: loop filter: replace 'orr; cbn?z' with 'adds; b.{eq,ne};Janne Grunau2016-11-16
| | | | | | | | | | | | The latter is 1 cycle faster on a cortex-53 and since the operands are bytewise (or larger) bitmask (impossible to overflow to zero) both are equivalent.
| * aarch64: vp9: use alternative returns in the core loop filter functionJanne Grunau2016-11-16
| | | | | | | | | | | | | | Since aarch64 has enough free general purpose registers use them to branch to the appropiate storage code. 1-2 cycles faster for the functions using loop_filter 8/16, ... on a cortex-a53. Mixed results (up to 2 cycles faster/slower) on a cortex-a57.
| * aarch64: vp9: loop_filter: fix typo in skip flatout8 checkJanne Grunau2016-11-14
| | | | | | | | | | | | | | The 16_16 loop filter functions could miss an early exit before flatout8. Signed-off-by: Martin Storsjö <martin@martin.st>
| * aarch64: vp9: Implement NEON loop filtersMartin Storsjö2016-11-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This work is sponsored by, and copyright, Google. These are ported from the ARM version; thanks to the larger amount of registers available, we can do the loop filters with 16 pixels at a time. The implementation is fully templated, with a single macro which can generate versions for both 8 and 16 pixels wide, for both 4, 8 and 16 pixels loop filters (and the 4/8 mixed versions as well). For the 8 pixel wide versions, it is pretty close in speed (the v_4_8 and v_8_8 filters are the best examples of this; the h_4_8 and h_8_8 filters seem to get some gain in the load/transpose/store part). For the 16 pixels wide ones, we get a speedup of around 1.2-1.4x compared to the 32 bit version. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_loop_filter_h_4_8_neon: 144.0 127.2 vp9_loop_filter_h_8_8_neon: 207.0 182.5 vp9_loop_filter_h_16_8_neon: 415.0 328.7 vp9_loop_filter_h_16_16_neon: 672.0 558.6 vp9_loop_filter_mix2_h_44_16_neon: 302.0 203.5 vp9_loop_filter_mix2_h_48_16_neon: 365.0 305.2 vp9_loop_filter_mix2_h_84_16_neon: 365.0 305.2 vp9_loop_filter_mix2_h_88_16_neon: 376.0 305.2 vp9_loop_filter_mix2_v_44_16_neon: 193.2 128.2 vp9_loop_filter_mix2_v_48_16_neon: 246.7 218.4 vp9_loop_filter_mix2_v_84_16_neon: 248.0 218.5 vp9_loop_filter_mix2_v_88_16_neon: 302.0 218.2 vp9_loop_filter_v_4_8_neon: 89.0 88.7 vp9_loop_filter_v_8_8_neon: 141.0 137.7 vp9_loop_filter_v_16_8_neon: 295.0 272.7 vp9_loop_filter_v_16_16_neon: 546.0 453.7 The speedup vs C code in checkasm tests is around 2-7x, which is pretty much the same as for the 32 bit version. Even if these functions are faster than their 32 bit equivalent, the C version that we compare to also became around 1.3-1.7x faster than the C version in 32 bit. Based on START_TIMER/STOP_TIMER wrapping around a few individual functions, the speedup vs C code is around 4-5x. Examples of runtimes vs C on a Cortex A57 (for a slightly older version of the patch): A57 gcc-5.3 neon loop_filter_h_4_8_neon: 256.6 93.4 loop_filter_h_8_8_neon: 307.3 139.1 loop_filter_h_16_8_neon: 340.1 254.1 loop_filter_h_16_16_neon: 827.0 407.9 loop_filter_mix2_h_44_16_neon: 524.5 155.4 loop_filter_mix2_h_48_16_neon: 644.5 173.3 loop_filter_mix2_h_84_16_neon: 630.5 222.0 loop_filter_mix2_h_88_16_neon: 697.3 222.0 loop_filter_mix2_v_44_16_neon: 598.5 100.6 loop_filter_mix2_v_48_16_neon: 651.5 127.0 loop_filter_mix2_v_84_16_neon: 591.5 167.1 loop_filter_mix2_v_88_16_neon: 855.1 166.7 loop_filter_v_4_8_neon: 271.7 65.3 loop_filter_v_8_8_neon: 312.5 106.9 loop_filter_v_16_8_neon: 473.3 206.5 loop_filter_v_16_16_neon: 976.1 327.8 The speed-up compared to the C functions is 2.5 to 6 and the cortex-a57 is again 30-50% faster than the cortex-a53. Signed-off-by: Martin Storsjö <martin@martin.st>
| * aarch64: vp9: Add NEON itxfm routinesMartin Storsjö2016-11-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This work is sponsored by, and copyright, Google. These are ported from the ARM version; thanks to the larger amount of registers available, we can do the 16x16 and 32x32 transforms in slices 8 pixels wide instead of 4. This gives a speedup of around 1.4x compared to the 32 bit version. The fact that aarch64 doesn't have the same d/q register aliasing makes some of the macros quite a bit simpler as well. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_inv_adst_adst_4x4_add_neon: 90.0 87.7 vp9_inv_adst_adst_8x8_add_neon: 400.0 354.7 vp9_inv_adst_adst_16x16_add_neon: 2526.5 1827.2 vp9_inv_dct_dct_4x4_add_neon: 74.0 72.7 vp9_inv_dct_dct_8x8_add_neon: 271.0 256.7 vp9_inv_dct_dct_16x16_add_neon: 1960.7 1372.7 vp9_inv_dct_dct_32x32_add_neon: 11988.9 8088.3 vp9_inv_wht_wht_4x4_add_neon: 63.0 57.7 The speedup vs C code (2-4x) is smaller than in the 32 bit case, mostly because the C code ends up significantly faster (around 1.6x faster, with GCC 5.4) when built for aarch64. Examples of runtimes vs C on a Cortex A57 (for a slightly older version of the patch): A57 gcc-5.3 neon vp9_inv_adst_adst_4x4_add_neon: 152.2 60.0 vp9_inv_adst_adst_8x8_add_neon: 948.2 288.0 vp9_inv_adst_adst_16x16_add_neon: 4830.4 1380.5 vp9_inv_dct_dct_4x4_add_neon: 153.0 58.6 vp9_inv_dct_dct_8x8_add_neon: 789.2 180.2 vp9_inv_dct_dct_16x16_add_neon: 3639.6 917.1 vp9_inv_dct_dct_32x32_add_neon: 20462.1 4985.0 vp9_inv_wht_wht_4x4_add_neon: 91.0 49.8 The asm is around factor 3-4 faster than C on the cortex-a57 and the asm is around 30-50% faster on the a57 compared to the a53. Signed-off-by: Martin Storsjö <martin@martin.st>
| * aarch64: h264idct: Use the offset parameter to movrelMartin Storsjö2016-11-10
| | | | | | | | Signed-off-by: Martin Storsjö <martin@martin.st>
| * aarch64: vp9: Add NEON optimizations of VP9 MC functionsMartin Storsjö2016-11-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This work is sponsored by, and copyright, Google. These are ported from the ARM version; it is essentially a 1:1 port with no extra added features, but with some hand tuning (especially for the plain copy/avg functions). The ARM version isn't very register starved to begin with, so there's not much to be gained from having more spare registers here - we only avoid having to clobber callee-saved registers. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_avg4_neon: 27.2 23.7 vp9_avg8_neon: 56.5 54.7 vp9_avg16_neon: 169.9 167.4 vp9_avg32_neon: 585.8 585.2 vp9_avg64_neon: 2460.3 2294.7 vp9_avg_8tap_smooth_4h_neon: 132.7 125.2 vp9_avg_8tap_smooth_4hv_neon: 478.8 442.0 vp9_avg_8tap_smooth_4v_neon: 126.0 93.7 vp9_avg_8tap_smooth_8h_neon: 241.7 234.2 vp9_avg_8tap_smooth_8hv_neon: 690.9 646.5 vp9_avg_8tap_smooth_8v_neon: 245.0 205.5 vp9_avg_8tap_smooth_64h_neon: 11273.2 11280.1 vp9_avg_8tap_smooth_64hv_neon: 22980.6 22184.1 vp9_avg_8tap_smooth_64v_neon: 11549.7 10781.1 vp9_put4_neon: 18.0 17.2 vp9_put8_neon: 40.2 37.7 vp9_put16_neon: 97.4 99.5 vp9_put32_neon/armv8: 346.0 307.4 vp9_put64_neon/armv8: 1319.0 1107.5 vp9_put_8tap_smooth_4h_neon: 126.7 118.2 vp9_put_8tap_smooth_4hv_neon: 465.7 434.0 vp9_put_8tap_smooth_4v_neon: 113.0 86.5 vp9_put_8tap_smooth_8h_neon: 229.7 221.6 vp9_put_8tap_smooth_8hv_neon: 658.9 621.3 vp9_put_8tap_smooth_8v_neon: 215.0 187.5 vp9_put_8tap_smooth_64h_neon: 10636.7 10627.8 vp9_put_8tap_smooth_64hv_neon: 21076.8 21026.9 vp9_put_8tap_smooth_64v_neon: 9635.0 9632.4 These are generally about as fast as the corresponding ARM routines on the same CPU (at least on the A53), in most cases marginally faster. The speedup vs C code is pretty much the same as for the 32 bit case; on the A53 it's around 6-13x for ther larger 8tap filters. The exact speedup varies a little, since the C versions generally don't end up exactly as slow/fast as on 32 bit. Signed-off-by: Martin Storsjö <martin@martin.st>
* | Merge commit '72a19f4013ec2c7f8581416f8ad4bf81df163fb6'James Almer2017-03-31
|\| | | | | | | | | | | | | * commit '72a19f4013ec2c7f8581416f8ad4bf81df163fb6': mpegaudiodsp: aarch64: Adjust function prototype after 2caa93b813adc5dbb7771dfe615da826a2947d18 Merged-by: James Almer <jamrial@gmail.com>
| * mpegaudiodsp: aarch64: Adjust function prototype after ↵Diego Biurrun2016-11-10
| | | | | | | | 2caa93b813adc5dbb7771dfe615da826a2947d18
* | aarch64/vp9dsp: add missing header includesJames Almer2017-03-28
| |
* | vp9: re-split the decoder/format/dsp interface header files.Ronald S. Bultje2017-03-28
| | | | | | | | | | The advantage here is that the internal software decoder interface is not exposed to the DSP functions or the hardware accelerations.
* | lavc/vp9: split into vp9{block,data,mvs}Clément Bœsch2017-03-27
| | | | | | | | This is following Libav layout to ease merges.
* | Merge commit '9b2ccafb480c94fd09cfb24306d5296dc013cf5b'Clément Bœsch2017-03-23
|\| | | | | | | | | | | | | * commit '9b2ccafb480c94fd09cfb24306d5296dc013cf5b': aarch64: Add missing sign extension in ff_h264_idct8_add_neon Merged-by: Clément Bœsch <u@pkh.me>
| * aarch64: Add missing sign extension in ff_h264_idct8_add_neonMartin Storsjö2016-10-10
| | | | | | | | Signed-off-by: Martin Storsjö <martin@martin.st>
* | Merge commit '2caa93b813adc5dbb7771dfe615da826a2947d18'James Almer2017-03-21
|\| | | | | | | | | | | | | * commit '2caa93b813adc5dbb7771dfe615da826a2947d18': mpegaudiodsp: Change type of array stride parameters to ptrdiff_t Merged-by: James Almer <jamrial@gmail.com>
| * mpegaudiodsp: Change type of array stride parameters to ptrdiff_tDiego Biurrun2016-09-29
| | | | | | | | | | This avoids SIMD-optimized functions having to sign-extend their stride argument manually to be able to do pointer arithmetic.
* | Merge commit 'e4a94d8b36c48d95a7d412c40d7b558422ff659c'James Almer2017-03-21
|\| | | | | | | | | | | | | * commit 'e4a94d8b36c48d95a7d412c40d7b558422ff659c': h264chroma: Change type of stride parameters to ptrdiff_t Merged-by: James Almer <jamrial@gmail.com>
| * h264chroma: Change type of stride parameters to ptrdiff_tDiego Biurrun2016-09-29
| | | | | | | | | | This avoids SIMD-optimized functions having to sign-extend their stride argument manually to be able to do pointer arithmetic.
* | Merge commit '2ec9fa5ec60dcd10e1cb10d8b4e4437e634ea428'James Almer2017-03-21
|\| | | | | | | | | | | | | * commit '2ec9fa5ec60dcd10e1cb10d8b4e4437e634ea428': idct: Change type of array stride parameters to ptrdiff_t Merged-by: James Almer <jamrial@gmail.com>
* | Merge commit 'de2ae3c1fae5a2eb539b9abd7bc2a9ca8c286ff0'Clément Bœsch2017-03-21
|\| | | | | | | | | | | | | | | | | * commit 'de2ae3c1fae5a2eb539b9abd7bc2a9ca8c286ff0': lavc: add clobber tests for the new encoding/decoding API The merge only re-order what we already have. Merged-by: Clément Bœsch <u@pkh.me>
| * lavc: add clobber tests for the new encoding/decoding APIAnton Khirnov2016-09-28
| |
| * libavcodec: fix constness in clobber test avcodec_open2() wrappersClément Bœsch2016-06-26
| | | | | | | | Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9itxfm16: Do a simpler half/quarter idct16/idct32 when possibleMartin Storsjö2017-03-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This work is sponsored by, and copyright, Google. This avoids loading and calculating coefficients that we know will be zero, and avoids filling the temp buffer with zeros in places where we know the second pass won't read. This gives a pretty substantial speedup for the smaller subpartitions. The code size increases from 21512 bytes to 31400 bytes. The idct16/32_end macros are moved above the individual functions; the instructions themselves are unchanged, but since new functions are added at the same place where the code is moved from, the diff looks rather messy. Before: vp9_inv_dct_dct_16x16_sub1_add_10_neon: 284.6 vp9_inv_dct_dct_16x16_sub2_add_10_neon: 1902.7 vp9_inv_dct_dct_16x16_sub4_add_10_neon: 1903.0 vp9_inv_dct_dct_16x16_sub8_add_10_neon: 2201.1 vp9_inv_dct_dct_16x16_sub12_add_10_neon: 2510.0 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 2821.3 vp9_inv_dct_dct_32x32_sub1_add_10_neon: 1011.6 vp9_inv_dct_dct_32x32_sub2_add_10_neon: 9716.5 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 9704.9 vp9_inv_dct_dct_32x32_sub8_add_10_neon: 10641.7 vp9_inv_dct_dct_32x32_sub12_add_10_neon: 11555.7 vp9_inv_dct_dct_32x32_sub16_add_10_neon: 12499.8 vp9_inv_dct_dct_32x32_sub20_add_10_neon: 13403.7 vp9_inv_dct_dct_32x32_sub24_add_10_neon: 14335.8 vp9_inv_dct_dct_32x32_sub28_add_10_neon: 15253.6 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 16179.5 After: vp9_inv_dct_dct_16x16_sub1_add_10_neon: 282.8 vp9_inv_dct_dct_16x16_sub2_add_10_neon: 1142.4 vp9_inv_dct_dct_16x16_sub4_add_10_neon: 1139.0 vp9_inv_dct_dct_16x16_sub8_add_10_neon: 1772.9 vp9_inv_dct_dct_16x16_sub12_add_10_neon: 2515.2 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 2823.5 vp9_inv_dct_dct_32x32_sub1_add_10_neon: 1012.7 vp9_inv_dct_dct_32x32_sub2_add_10_neon: 6944.4 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 6944.2 vp9_inv_dct_dct_32x32_sub8_add_10_neon: 7609.8 vp9_inv_dct_dct_32x32_sub12_add_10_neon: 9953.4 vp9_inv_dct_dct_32x32_sub16_add_10_neon: 10770.1 vp9_inv_dct_dct_32x32_sub20_add_10_neon: 13418.8 vp9_inv_dct_dct_32x32_sub24_add_10_neon: 14330.7 vp9_inv_dct_dct_32x32_sub28_add_10_neon: 15257.1 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 16190.6 Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9itxfm16: Move the load_add_store macro out from the itxfm16 ↵Martin Storsjö2017-03-19
| | | | | | | | | | | | | | | | | | pass2 function This allows reusing the macro for a separate implementation of the pass2 function. Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9itxfm16: Make the larger core transforms standalone functionsMartin Storsjö2017-03-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This work is sponsored by, and copyright, Google. This reduces the code size of libavcodec/aarch64/vp9itxfm_16bpp_neon.o from 26288 to 21512 bytes. This gives a small slowdown of a couple of tens of cycles, but makes it more feasible to add more optimized versions of these transforms. Before: vp9_inv_dct_dct_16x16_sub4_add_10_neon: 1887.4 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 2801.5 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 9691.4 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 16154.9 After: vp9_inv_dct_dct_16x16_sub4_add_10_neon: 1899.5 vp9_inv_dct_dct_16x16_sub16_add_10_neon: 2827.2 vp9_inv_dct_dct_32x32_sub4_add_10_neon: 9714.7 vp9_inv_dct_dct_32x32_sub32_add_10_neon: 16175.9 Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9itxfm16: Restructure the idct32 store macrosMartin Storsjö2017-03-19
| | | | | | | | | | | | | | This avoids concatenation, which can't be used if the whole macro is wrapped within another macro. Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9itxfm16: Avoid .irp when it doesn't save any linesMartin Storsjö2017-03-19
| | | | | | | | | | | | This makes the code a bit more readable. Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9itxfm16: Fix a typo in a commentMartin Storsjö2017-03-19
| | | | | | | | Signed-off-by: Martin Storsjö <martin@martin.st>
* | arm/aarch64: vp9: Fix vertical alignmentMartin Storsjö2017-03-19
| | | | | | | | | | | | | | | | | | | | | | | | | | Align the second/third operands as they usually are. Due to the wildly varying sizes of the written out operands in aarch64 assembly, the column alignment is usually not as clear as in arm assembly. This is cherrypicked from libav commit 7995ebfad12002033c73feed422a1cfc62081e8f. Signed-off-by: Martin Storsjö <martin@martin.st>
* | arm/aarch64: vp9itxfm: Skip loading the min_eob pointer when it won't be usedMartin Storsjö2017-03-19
| | | | | | | | | | | | | | | | | | | | In the half/quarter cases where we don't use the min_eob array, defer loading the pointer until we know it will be needed. This is cherrypicked from libav commit 3a0d5e206d24d41d87a25ba16a79b2ea04c39d4c. Signed-off-by: Martin Storsjö <martin@martin.st>
* | lavc/aarch64: add ff_simple_idct{,_add,_put}_neon functionsMatthieu Bouron2017-03-16
| |
* | aarch64: vp9itxfm: Reorder iadst16 coeffsMartin Storsjö2017-03-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This matches the order they are in the 16 bpp version. There they are in this order, to make sure we access them in the same order they are declared, easing loading only half of the coefficients at a time. This makes the 8 bpp version match the 16 bpp version better. This is cherrypicked from libav commit b8f66c0838b4c645227f23a35b4d54373da4c60a. Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9itxfm: Reorder the idct coefficients for better pairingMartin Storsjö2017-03-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All elements are used pairwise, except for the first one. Previously, the 16th element was unused. Move the unused element to the second slot, to make the later element pairs not split across registers. This simplifies loading only parts of the coefficients, reducing the difference to the 16 bpp version. This is cherrypicked from libav commit 09eb88a12e008d10a3f7a6be75d18ad98b368e68. Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9itxfm: Avoid reloading the idct32 coefficientsMartin Storsjö2017-03-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The idct32x32 function actually pushed d8-d15 onto the stack even though it didn't clobber them; there are plenty of registers that can be used to allow keeping all the idct coefficients in registers without having to reload different subsets of them at different stages in the transform. After this, we still can skip pushing d12-d15. Before: vp9_inv_dct_dct_32x32_sub32_add_neon: 8128.3 After: vp9_inv_dct_dct_32x32_sub32_add_neon: 8053.3 This is cherrypicked from libav commit 65aa002d54433154a6924dc13e498bec98451ad0. Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9lpf: Use dup+rev16+uzp1 instead of dup+lsr+dup+trn1Martin Storsjö2017-03-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This is one cycle faster in total, and three instructions fewer. Before: vp9_loop_filter_mix2_v_44_16_neon: 123.2 After: vp9_loop_filter_mix2_v_44_16_neon: 122.2 This is cherrypicked from libav commit 3bf9c48320f25f3d5557485b0202f22ae60748b0. Signed-off-by: Martin Storsjö <martin@martin.st>
* | arm/aarch64: vp9lpf: Keep the comparison to E within 8 bitMartin Storsjö2017-03-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The theoretical maximum value of E is 193, so we can just saturate the addition to 255. Before: Cortex A7 A8 A9 A53 A53/AArch64 vp9_loop_filter_v_4_8_neon: 143.0 127.7 114.8 88.0 87.7 vp9_loop_filter_v_8_8_neon: 241.0 197.2 173.7 140.0 136.7 vp9_loop_filter_v_16_8_neon: 497.0 419.5 379.7 293.0 275.7 vp9_loop_filter_v_16_16_neon: 965.2 818.7 731.4 579.0 452.0 After: vp9_loop_filter_v_4_8_neon: 136.0 125.7 112.6 84.0 83.0 vp9_loop_filter_v_8_8_neon: 234.0 195.5 171.5 136.0 133.7 vp9_loop_filter_v_16_8_neon: 490.0 417.5 377.7 289.0 271.0 vp9_loop_filter_v_16_16_neon: 951.2 814.7 732.3 571.0 446.7 This is cherrypicked from libav commit c582cb8537367721bb399a5d01b652c20142b756. Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9lpf: Fix broken indentation/vertical alignmentMartin Storsjö2017-03-11
| | | | | | | | | | | | | | This is cherrypicked from libav commit 07b5136c481d394992c7e951967df0cfbb346c0b. Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9lpf: Interleave the start of flat8in into the calculation aboveMartin Storsjö2017-03-11
| | | | | | | | | | | | | | | | | | | | This adds lots of extra .ifs, but speeds it up by a couple cycles, by avoiding stalls. This is cherrypicked from libav commit b0806088d3b27044145b20421da8d39089ae0c6a. Signed-off-by: Martin Storsjö <martin@martin.st>
* | arm/aarch64: vp9lpf: Calculate !hev directlyMartin Storsjö2017-03-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously we first calculated hev, and then negated it. Since we were able to schedule the negation in the middle of another calculation, we don't see any gain in all cases. Before: Cortex A7 A8 A9 A53 A53/AArch64 vp9_loop_filter_v_4_8_neon: 147.0 129.0 115.8 89.0 88.7 vp9_loop_filter_v_8_8_neon: 242.0 198.5 174.7 140.0 136.7 vp9_loop_filter_v_16_8_neon: 500.0 419.5 382.7 293.0 275.7 vp9_loop_filter_v_16_16_neon: 971.2 825.5 731.5 579.0 453.0 After: vp9_loop_filter_v_4_8_neon: 143.0 127.7 114.8 88.0 87.7 vp9_loop_filter_v_8_8_neon: 241.0 197.2 173.7 140.0 136.7 vp9_loop_filter_v_16_8_neon: 497.0 419.5 379.7 293.0 275.7 vp9_loop_filter_v_16_16_neon: 965.2 818.7 731.4 579.0 452.0 This is cherrypicked from libav commit e1f9de86f454861b69b199ad801adc2ec6c3b220. Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9itxfm: Optimize 16x16 and 32x32 idct dc by unrollingMartin Storsjö2017-03-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This work is sponsored by, and copyright, Google. Before: Cortex A53 vp9_inv_dct_dct_16x16_sub1_add_neon: 235.3 vp9_inv_dct_dct_32x32_sub1_add_neon: 555.1 After: vp9_inv_dct_dct_16x16_sub1_add_neon: 180.2 vp9_inv_dct_dct_32x32_sub1_add_neon: 475.3 This is cherrypicked from libav commit 3fcf788fbbccc4130868e7abe58a88990290f7c1. Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9mc: Calculate less unused data in the 4 pixel wide horizontal filterMartin Storsjö2017-03-11
| | | | | | | | | | | | | | | | | | No measured speedup on a Cortex A53, but other cores might benefit. This is cherrypicked from libav commit 388e0d2515bc6bbc9d0c9af1d230bd16cf945fe7. Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9mc: Simplify the extmla macro parametersMartin Storsjö2017-03-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fold the field lengths into the macro. This makes the macro invocations much more readable, when the lines are shorter. This also makes it easier to use only half the registers within the macro. This is cherrypicked from libav commit 5e0c2158fbc774f87d3ce4b7b950ba4d42c4a7b8. Signed-off-by: Martin Storsjö <martin@martin.st>
* | aarch64: vp9itxfm: Fix incorrect vertical alignmentMartin Storsjö2017-03-11
| | | | | | | | | | | | | | This is cherrypicked from libav commit 0c0b87f12d48d4e7f0d3d13f9345e828a3a5ea32. Signed-off-by: Martin Storsjö <martin@martin.st>