summaryrefslogtreecommitdiff
path: root/libavcodec/arm
Commit message (Collapse)AuthorAge
* arm: vp9mc: Fix vertical alignment of operandsMartin Storsjö2017-01-03
| | | | Signed-off-by: Martin Storsjö <martin@martin.st>
* arm: vp9itxfm: Skip empty slices in the first pass of idct_idct 16x16 and 32x32Martin Storsjö2016-11-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This work is sponsored by, and copyright, Google. Previously all subpartitions except the eob=1 (DC) case ran with the same runtime: Cortex A7 A8 A9 A53 vp9_inv_dct_dct_16x16_sub16_add_neon: 3188.1 2435.4 2499.0 1969.0 vp9_inv_dct_dct_32x32_sub32_add_neon: 18531.7 16582.3 14207.6 12000.3 By skipping individual 4x16 or 4x32 pixel slices in the first pass, we reduce the runtime of these functions like this: vp9_inv_dct_dct_16x16_sub1_add_neon: 274.6 189.5 211.7 235.8 vp9_inv_dct_dct_16x16_sub2_add_neon: 2064.0 1534.8 1719.4 1248.7 vp9_inv_dct_dct_16x16_sub4_add_neon: 2135.0 1477.2 1736.3 1249.5 vp9_inv_dct_dct_16x16_sub8_add_neon: 2446.7 1828.7 1993.6 1494.7 vp9_inv_dct_dct_16x16_sub12_add_neon: 2832.4 2118.3 2266.5 1735.1 vp9_inv_dct_dct_16x16_sub16_add_neon: 3211.7 2475.3 2523.5 1983.1 vp9_inv_dct_dct_32x32_sub1_add_neon: 756.2 456.7 862.0 553.9 vp9_inv_dct_dct_32x32_sub2_add_neon: 10682.2 8190.4 8539.2 6762.5 vp9_inv_dct_dct_32x32_sub4_add_neon: 10813.5 8014.9 8518.3 6762.8 vp9_inv_dct_dct_32x32_sub8_add_neon: 11859.6 9313.0 9347.4 7514.5 vp9_inv_dct_dct_32x32_sub12_add_neon: 12946.6 10752.4 10192.2 8280.2 vp9_inv_dct_dct_32x32_sub16_add_neon: 14074.6 11946.5 11001.4 9008.6 vp9_inv_dct_dct_32x32_sub20_add_neon: 15269.9 13662.7 11816.1 9762.6 vp9_inv_dct_dct_32x32_sub24_add_neon: 16327.9 14940.1 12626.7 10516.0 vp9_inv_dct_dct_32x32_sub28_add_neon: 17462.7 15776.1 13446.2 11264.7 vp9_inv_dct_dct_32x32_sub32_add_neon: 18575.5 17157.0 14249.3 12015.1 I.e. in general a very minor overhead for the full subpartition case due to the additional loads and cmps, but a significant speedup for the cases when we only need to process a small part of the actual input data. In common VP9 content in a few inspected clips, 70-90% of the non-dc-only 16x16 and 32x32 IDCTs only have nonzero coefficients in the upper left 8x8 or 16x16 subpartitions respectively. Signed-off-by: Martin Storsjö <martin@martin.st>
* arm: vp9itxfm: Only reload the idct coeffs for the iadst_idct combinationMartin Storsjö2016-11-30
| | | | | | | | | This avoids reloading them if they haven't been clobbered, if the first pass also was idct. This is similar to what was done in the aarch64 version. Signed-off-by: Martin Storsjö <martin@martin.st>
* arm: vp9itxfm: Rename a macro parameter to fit betterMartin Storsjö2016-11-23
| | | | | | | | | Since the same parameter is used for both input and output, the name inout is more fitting. This matches the naming used below in the dmbutterfly macro. Signed-off-by: Martin Storsjö <martin@martin.st>
* arm/aarch64: vp9itxfm: Fix indentation of macro argumentsMartin Storsjö2016-11-23
| | | | Signed-off-by: Martin Storsjö <martin@martin.st>
* arm: vp9itxfm: Simplify the stack alignment codeJanne Grunau2016-11-18
| | | | | | | This is one instruction less for thumb, and only have got 1/2 arm/thumb specific instructions. Signed-off-by: Martin Storsjö <martin@martin.st>
* arm: vp9itxfm: Simplify txfm string comparisonsMartin Storsjö2016-11-14
| | | | Signed-off-by: Martin Storsjö <martin@martin.st>
* arm: vp9: Add NEON loop filtersMartin Storsjö2016-11-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This work is sponsored by, and copyright, Google. The implementation tries to have smart handling of cases where no pixels need the full filtering for the 8/16 width filters, skipping both calculation and writeback of the unmodified pixels in those cases. The actual effect of this is hard to test with checkasm though, since it tests the full filtering, and the benefit depends on how many filtered blocks use the shortcut. Examples of relative speedup compared to the C version, from checkasm: Cortex A7 A8 A9 A53 vp9_loop_filter_h_4_8_neon: 2.72 2.68 1.78 3.15 vp9_loop_filter_h_8_8_neon: 2.36 2.38 1.70 2.91 vp9_loop_filter_h_16_8_neon: 1.80 1.89 1.45 2.01 vp9_loop_filter_h_16_16_neon: 2.81 2.78 2.18 3.16 vp9_loop_filter_mix2_h_44_16_neon: 2.65 2.67 1.93 3.05 vp9_loop_filter_mix2_h_48_16_neon: 2.46 2.38 1.81 2.85 vp9_loop_filter_mix2_h_84_16_neon: 2.50 2.41 1.73 2.85 vp9_loop_filter_mix2_h_88_16_neon: 2.77 2.66 1.96 3.23 vp9_loop_filter_mix2_v_44_16_neon: 4.28 4.46 3.22 5.70 vp9_loop_filter_mix2_v_48_16_neon: 3.92 4.00 3.03 5.19 vp9_loop_filter_mix2_v_84_16_neon: 3.97 4.31 2.98 5.33 vp9_loop_filter_mix2_v_88_16_neon: 3.91 4.19 3.06 5.18 vp9_loop_filter_v_4_8_neon: 4.53 4.47 3.31 6.05 vp9_loop_filter_v_8_8_neon: 3.58 3.99 2.92 5.17 vp9_loop_filter_v_16_8_neon: 3.40 3.50 2.81 4.68 vp9_loop_filter_v_16_16_neon: 4.66 4.41 3.74 6.02 The speedup vs C code is around 2-6x. The numbers are quite inconclusive though, since the checkasm test runs multiple filterings on top of each other, so later rounds might end up with different codepaths (different decisions on which filter to apply, based on input pixel differences). Disabling the early-exit in the asm doesn't give a fair comparison either though, since the C code only does the necessary calcuations for each row. Based on START_TIMER/STOP_TIMER wrapping around a few individual functions, the speedup vs C code is around 4-9x. This is pretty similar in runtime to the corresponding routines in libvpx. (This is comparing vpx_lpf_vertical_16_neon, vpx_lpf_horizontal_edge_8_neon and vpx_lpf_horizontal_edge_16_neon to vp9_loop_filter_h_16_8_neon, vp9_loop_filter_v_16_8_neon and vp9_loop_filter_v_16_16_neon - note that the naming of horizonal and vertical is flipped between the libraries.) In order to have stable, comparable numbers, the early exits in both asm versions were disabled, forcing the full filtering codepath. Cortex A7 A8 A9 A53 vp9_loop_filter_h_16_8_neon: 597.2 472.0 482.4 415.0 libvpx vpx_lpf_vertical_16_neon: 626.0 464.5 470.7 445.0 vp9_loop_filter_v_16_8_neon: 500.2 422.5 429.7 295.0 libvpx vpx_lpf_horizontal_edge_8_neon: 586.5 414.5 415.6 383.2 vp9_loop_filter_v_16_16_neon: 905.0 784.7 791.5 546.0 libvpx vpx_lpf_horizontal_edge_16_neon: 1060.2 751.7 743.5 685.2 Our version is consistently faster on on A7 and A53, marginally slower on A8, and sometimes faster, sometimes slower on A9 (marginally slower in all three tests in this particular test run). Signed-off-by: Martin Storsjö <martin@martin.st>
* arm: vp9: Add NEON itxfm routinesMartin Storsjö2016-11-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This work is sponsored by, and copyright, Google. For the transforms up to 8x8, we can fit all the data (including temporaries) in registers and just do a straightforward transform of all the data. For 16x16, we do a transform of 4x16 pixels in 4 slices, using a temporary buffer. For 32x32, we transform 4x32 pixels at a time, in two steps of 4x16 pixels each. Examples of relative speedup compared to the C version, from checkasm: Cortex A7 A8 A9 A53 vp9_inv_adst_adst_4x4_add_neon: 3.39 5.83 4.17 4.01 vp9_inv_adst_adst_8x8_add_neon: 3.79 4.86 4.23 3.98 vp9_inv_adst_adst_16x16_add_neon: 3.33 4.36 4.11 4.16 vp9_inv_dct_dct_4x4_add_neon: 4.06 6.16 4.59 4.46 vp9_inv_dct_dct_8x8_add_neon: 4.61 6.01 4.98 4.86 vp9_inv_dct_dct_16x16_add_neon: 3.35 3.44 3.36 3.79 vp9_inv_dct_dct_32x32_add_neon: 3.89 3.50 3.79 4.42 vp9_inv_wht_wht_4x4_add_neon: 3.22 5.13 3.53 3.77 Thus, the speedup vs C code is around 3-6x. This is mostly marginally faster than the corresponding routines in libvpx on most cores, tested with their 32x32 idct (compared to vpx_idct32x32_1024_add_neon). These numbers are slightly in libvpx's favour since their version doesn't clear the input buffer like ours do (although the effect of that on the total runtime probably is negligible.) Cortex A7 A8 A9 A53 vp9_inv_dct_dct_32x32_add_neon: 18436.8 16874.1 14235.1 11988.9 libvpx vpx_idct32x32_1024_add_neon 20789.0 13344.3 15049.9 13030.5 Only on the Cortex A8, the libvpx function is faster. On the other cores, ours is slightly faster even though ours has got source block clearing integrated. Signed-off-by: Martin Storsjö <martin@martin.st>
* arm: vp9mc: Use a different helper register for PIC loadsMartin Storsjö2016-11-10
| | | | | | | | | This fixes crashes since 557c1675cf in linux PIC builds. Previously, movrelx silently used r12 as helper register, which doesn't work when r12 is the destination register. Signed-off-by: Martin Storsjö <martin@martin.st>
* arm: vp9mc: Minor adjustments from review of the aarch64 versionMartin Storsjö2016-11-10
| | | | | | | | | | | | | | | | This work is sponsored by, and copyright, Google. The speedup for the large horizontal filters is surprisingly big on A7 and A53, while there's a minor slowdown (almost within measurement noise) on A8 and A9. Cortex A7 A8 A9 A53 orig: vp9_put_8tap_smooth_64h_neon: 20270.0 14447.3 19723.9 10910.9 new: vp9_put_8tap_smooth_64h_neon: 20165.8 14466.5 19730.2 10668.8 Signed-off-by: Martin Storsjö <martin@martin.st>
* arm: vp9mc: Insert a literal pool at the middle of the fileMartin Storsjö2016-11-04
| | | | | | | | | This fixes errors like this when building non-pic binaries with armv6 as baseline: Error: invalid literal constant: pool needs to be closer Signed-off-by: Martin Storsjö <martin@martin.st>
* arm: vp9: Add NEON optimizations of VP9 MC functionsMartin Storsjö2016-11-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This work is sponsored by, and copyright, Google. The filter coefficients are signed values, where the product of the multiplication with one individual filter coefficient doesn't overflow a 16 bit signed value (the largest filter coefficient is 127). But when the products are accumulated, the resulting sum can overflow the 16 bit signed range. Instead of accumulating in 32 bit, we accumulate the largest product (either index 3 or 4) last with a saturated addition. (The VP8 MC asm does something similar, but slightly simpler, by accumulating each half of the filter separately. In the VP9 MC filters, each half of the filter can also overflow though, so the largest component has to be handled individually.) Examples of relative speedup compared to the C version, from checkasm: Cortex A7 A8 A9 A53 vp9_avg4_neon: 1.71 1.15 1.42 1.49 vp9_avg8_neon: 2.51 3.63 3.14 2.58 vp9_avg16_neon: 2.95 6.76 3.01 2.84 vp9_avg32_neon: 3.29 6.64 2.85 3.00 vp9_avg64_neon: 3.47 6.67 3.14 2.80 vp9_avg_8tap_smooth_4h_neon: 3.22 4.73 2.76 4.67 vp9_avg_8tap_smooth_4hv_neon: 3.67 4.76 3.28 4.71 vp9_avg_8tap_smooth_4v_neon: 5.52 7.60 4.60 6.31 vp9_avg_8tap_smooth_8h_neon: 6.22 9.04 5.12 9.32 vp9_avg_8tap_smooth_8hv_neon: 6.38 8.21 5.72 8.17 vp9_avg_8tap_smooth_8v_neon: 9.22 12.66 8.15 11.10 vp9_avg_8tap_smooth_64h_neon: 7.02 10.23 5.54 11.58 vp9_avg_8tap_smooth_64hv_neon: 6.76 9.46 5.93 9.40 vp9_avg_8tap_smooth_64v_neon: 10.76 14.13 9.46 13.37 vp9_put4_neon: 1.11 1.47 1.00 1.21 vp9_put8_neon: 1.23 2.17 1.94 1.48 vp9_put16_neon: 1.63 4.02 1.73 1.97 vp9_put32_neon: 1.56 4.92 2.00 1.96 vp9_put64_neon: 2.10 5.28 2.03 2.35 vp9_put_8tap_smooth_4h_neon: 3.11 4.35 2.63 4.35 vp9_put_8tap_smooth_4hv_neon: 3.67 4.69 3.25 4.71 vp9_put_8tap_smooth_4v_neon: 5.45 7.27 4.49 6.52 vp9_put_8tap_smooth_8h_neon: 5.97 8.18 4.81 8.56 vp9_put_8tap_smooth_8hv_neon: 6.39 7.90 5.64 8.15 vp9_put_8tap_smooth_8v_neon: 9.03 11.84 8.07 11.51 vp9_put_8tap_smooth_64h_neon: 6.78 9.48 4.88 10.89 vp9_put_8tap_smooth_64hv_neon: 6.99 8.87 5.94 9.56 vp9_put_8tap_smooth_64v_neon: 10.69 13.30 9.43 14.34 For the larger 8tap filters, the speedup vs C code is around 5-14x. This is significantly faster than libvpx's implementation of the same functions, at least when comparing the put_8tap_smooth_64 functions (compared to vpx_convolve8_horiz_neon and vpx_convolve8_vert_neon from libvpx). Absolute runtimes from checkasm: Cortex A7 A8 A9 A53 vp9_put_8tap_smooth_64h_neon: 20150.3 14489.4 19733.6 10863.7 libvpx vpx_convolve8_horiz_neon: 52623.3 19736.4 21907.7 25027.7 vp9_put_8tap_smooth_64v_neon: 14455.0 12303.9 13746.4 9628.9 libvpx vpx_convolve8_vert_neon: 42090.0 17706.2 17659.9 16941.2 Thus, on the A9, the horizontal filter is only marginally faster than libvpx, while our version is significantly faster on the other cores, and the vertical filter is significantly faster on all cores. The difference is especially large on the A7. The libvpx implementation does the accumulation in 32 bit, which probably explains most of the differences. Signed-off-by: Martin Storsjö <martin@martin.st>
* h264chroma: Change type of stride parameters to ptrdiff_tDiego Biurrun2016-09-29
| | | | | This avoids SIMD-optimized functions having to sign-extend their stride argument manually to be able to do pointer arithmetic.
* idct: Change type of array stride parameters to ptrdiff_tDiego Biurrun2016-09-29
| | | | ptrdiff_t is the correct type for array strides and similar.
* hpeldsp: arm: Update comments left behind in ↵Diego Biurrun2016-09-29
| | | | 25841dfe806a13de526ae09c11149ab1f83555a8
* lavc: add clobber tests for the new encoding/decoding APIAnton Khirnov2016-09-28
|
* audiodsp: reorder arguments for vector_clipfAnton Khirnov2016-09-22
| | | | | | | This will make the x86 asm simpler. ARM conversion by Martin Storsjö <martin@martin.st> and Janne Grunau <janne-libav@jannau.net>
* blockdsp: drop the high_bit_depth parameterAnton Khirnov2016-09-22
| | | | | It has no effect, since the code is supposed to operate the same way for any bit depth.
* pixblockdsp: Change type of stride parameters to ptrdiff_tDiego Biurrun2016-09-14
| | | | | | | This avoids SIMD-optimized functions having to sign-extend their line size argument manually to be able to do pointer arithmetic. Also adjust parameter names to be "stride" everywhere.
* vp56: Separate VP5 and VP6 dsp initializationDiego Biurrun2016-08-26
| | | | | VP5 has no arch-specific optimizations (nor will it get some in the future), so it makes no sense to try to share dsp init code with VP6.
* vp8: Update some assembly comments left unchanged in bd66f073fe7286bd3cDiego Biurrun2016-08-26
|
* vp56: Change type of stride parameters to ptrdiff_tDiego Biurrun2016-08-26
| | | | | This avoids SIMD-optimized functions having to sign-extend their line size argument manually to be able to do pointer arithmetic.
* vp3: Change type of stride parameters to ptrdiff_tDiego Biurrun2016-08-26
| | | | | | | This avoids SIMD-optimized functions having to sign-extend their stride argument manually to be able to do pointer arithmetic. Also adjust parameter names to be "stride" everywhere.
* simple_idct: arm: Drop disabled code variantDiego Biurrun2016-08-17
|
* vp8/armv6: mc: avoid boolean expression in calculationJanne Grunau2016-07-10
| | | | | | | | | | | | GNU as evaluates true as '-1' while Apple's variant and llvm's internal assembler evaluate it as '1'. The best way to avoid this madness is to eliminate boolean expressions instead of trying to fix it with preprocessor directives. Use a direct formula to calculate the required temporary space on the stack in ff_put_vp8_{epel,bilin}{4,8,16}_h[246]v[246]_armv6(). Fixes a checkasm segfault in vp8dsp.mc when using llvm's internal assembler for a non-Apple target.
* arm: Fix a typo in a commentMartin Storsjö2016-07-06
| | | | Signed-off-by: Martin Storsjö <martin@martin.st>
* libavcodec: fix constness in clobber test avcodec_open2() wrappersClément Bœsch2016-06-26
| | | | Signed-off-by: Martin Storsjö <martin@martin.st>
* tests: Move all test programs to a subdirectoryDiego Biurrun2016-05-13
|
* cosmetics: Fix spelling mistakesVittorio Giovara2016-05-04
| | | | Signed-off-by: Diego Biurrun <diego@biurrun.de>
* build: miscellaneous cosmeticsDiego Biurrun2016-04-07
| | | | | | Restore alphabetical order in lists, break overly long lines, do some prettyprinting, add some explanatory section comments, group parts together that belong together logically.
* fft: Split MDCT bits off from FFTDiego Biurrun2016-03-01
|
* rdft: arm: Split RDFT initialization into a separate fileDiego Biurrun2016-02-26
|
* fft: arm: Drop unnecessary #include, add missing onesDiego Biurrun2016-02-26
|
* build: Add vc1dsp component for more fine-grained dependenciesDiego Biurrun2016-02-19
|
* dca: remove unused decode_hf function and quant_d tablesAlexandra Hájková2015-12-24
| | | | | They were superseded with their integer equivalents. Rename integer decode_hf to decode_hf.
* arm: add ff_int32_to_float_fmul_array8_neonJanne Grunau2015-12-14
| | | | | | | | | | | | | | | | | | | | | | | Quite a bit faster than int32_to_float_fmul_array8_c calling ff_int32_to_float_fmul_scalar_neon through FmtConvertContext. Number of cycles per int32_to_float_fmul_array8 call while decoding padded.dts on exynos5422: before after change cortex-a7: 1270 951 -25% cortex-a15: 434 285 -34% checkasm --bench cycle counts: cortex-a15 cortex-a7 int32_to_float_fmul_array8_c: 1730.4 4384.5 int32_to_float_fmul_array8_neon_c: 571.5 1694.3 int32_to_float_fmul_array8_neon: 374.0 1448.8 Interesting are the differences between int32_to_float_fmul_array8_neon_c and int32_to_float_fmul_array8_neon. The former is current behaviour of calling ff_int32_to_float_fmul_scalar_neon repeatedly from the c function, The raw numbers differ since checkasm uses different lengths than the dca decoder.
* arm: add a cpu flag for the VFPv2 vector modeJanne Grunau2015-12-14
| | | | | | | | | | | | | | The vector mode was deprecated in ARMv7-A/VFPv3 and various cpu implementations do not support it in hardware. Vector mode code will depending the OS either be emulated in software or result in an illegal instruction on cpus which does not support it. This was not really problem in practice since NEON implementations of the same functions are preferred. It will however become a problem for checkasm which tests every cpu flag separately. Since this is a cpu feature newer cpu do not support anymore the behaviour of this flag differs from the other flags. It can be only activated by runtime cpu feature selection.
* arm: use a local label instead of the function symbol in ff_prefetch_armJanne Grunau2015-07-20
| | | | | | | | Avoids a relocation which might end out of range for thumb2. Reported-By: Ludovic Fauvet <etix@videolan.org> Bug-Id: https://bugs.webkit.org/show_bug.cgi?id=137022 CC: libav-stable@libav.org
* h264: arm: use intra pred8x8 functions only for chroma_format_idc <= 1Janne Grunau2015-07-18
|
* configure: Factor out g722dsp moduleVittorio Giovara2015-07-17
|
* configure: Factor out vp8dsp moduleVittorio Giovara2015-07-17
|
* configure: Factor out rv34dsp moduleVittorio Giovara2015-07-17
|
* configure: Factor out flacdsp moduleVittorio Giovara2015-07-17
|
* lavc: do not compile fmtconvert unconditionallyAnton Khirnov2015-02-28
| | | | Only ac3dec and dcadec use it.
* fmtconvert: drop unused functionsAnton Khirnov2015-02-28
|
* g722: Add ARM NEON implementation for g722_apply_qmf()Peter Meerwald2015-02-15
| | | | | Signed-off-by: Peter Meerwald <pmeerw@pmeerw.net> Signed-off-by: Martin Storsjö <martin@martin.st>
* arm: mlpdsp: handle pic offset calculation in a macroJanne Grunau2014-12-09
| | | | | Makes the code easier to read since it hides different offset calculations for arm and thumb mode.
* arm: make ff_mlp_filter_channel_arm and ff_mlp_rematrix_channel_arm position ↵Janne Grunau2014-12-09
| | | | | | independent No significant difference in used cpu cycles on a cortex-a9.
* arm: Use .data.rel.ro for const data with relocationsMartin Storsjö2014-12-09
| | | | Signed-off-by: Martin Storsjö <martin@martin.st>