summaryrefslogtreecommitdiff
path: root/libavcodec/x86
Commit message (Collapse)AuthorAge
* aacenc: add SIMD optimizations for abs_pow34 and quantizationRostislav Pehlivanov2016-10-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Performance improvements: quant_bands: with: 681 decicycles in quant_bands, 8388453 runs, 155 skips without: 1190 decicycles in quant_bands, 8388386 runs, 222 skips Around 42% for the function Twoloop coder: abs_pow34: with/without: 7.82s/8.17s Around 4% for the entire encoder Both: with/without: 7.15s/8.17s Around 12% for the entire encoder Fast coder: abs_pow34: with/without: 3.40s/3.77s Around 10% for the entire encoder Both: with/without: 3.02s/3.77s Around 20% faster for the entire encoder Signed-off-by: Rostislav Pehlivanov <atomnuker@gmail.com> Tested-by: Michael Niedermayer <michael@niedermayer.cc> Reviewed-by: James Almer <jamrial@gmail.com>
* avcodec: fix arguments on xmm/neon clobber test wrappersJames Almer2016-10-02
| | | | Signed-off-by: James Almer <jamrial@gmail.com>
* avcodec: add missing xmm/neon clobber test wrappers for the new encode APIJames Almer2016-10-01
| | | | | Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: James Almer <jamrial@gmail.com>
* x86/h264_weight: use appropriate register size for weight parametersHendrik Leppkes2016-09-23
| | | | | | | Fixes trac 5579 Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Acked-by: Michael Niedermayer <michael@niedermayer.cc>
* avcodec/h264: Use ptrdiff_t for (bi)weight functionsMichael Niedermayer2016-09-23
| | | | Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
* avcodec/ttadsp: cosmeticsJames Almer2016-08-06
| | | | | | | Clean some header includes and use the same naming scheme as in ttaencdsp Signed-off-by: James Almer <jamrial@gmail.com>
* x86/ttaenc: add ff_ttaenc_filter_process_{ssse3,sse4}James Almer2016-08-02
| | | | Signed-off-by: James Almer <jamrial@gmail.com>
* Merge commit '9df889a5f116c1ee78c2f239e0ba599c492431aa'Clément Bœsch2016-07-29
|\ | | | | | | | | | | | | * commit '9df889a5f116c1ee78c2f239e0ba599c492431aa': h264: rename h264.[ch] to h264dec.[ch] Merged-by: Clément Bœsch <u@pkh.me>
| * h264: rename h264.[ch] to h264dec.[ch]Anton Khirnov2016-06-21
| | | | | | | | This is more consistent with the naming of other decoders.
* | vp9: add mxext versions of the single-block (w=8,npx=8) h/v loopfilters.Ronald S. Bultje2016-07-26
| | | | | | | | | | Each takes about 0.1% of runtime in my profiles, and they didn't have any SIMD yet so far (we only had simd for npx=16 double-block versions).
* | vp9: add mxext versions of the single-block (w=4,npx=8) h/v loopfilters.Ronald S. Bultje2016-07-26
| | | | | | | | | | Each takes about 0.5% of runtime in my profiles, and they didn't have any SIMD yet so far (we only had simd for npx=16 double-block versions).
* | vp9: add 32x32 idct AVX2 implementation.Ronald S. Bultje2016-07-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | About 1.8x speedup compared to AVX version for full IDCT. Other sub-IDCT scenarios also see speedups. Full --bench output for idct_32x32_add_{bpp}_${subidct}_${opt} (50k cycles): nop: 16.5 vp9_inv_dct_dct_32x32_add_8_1_c: 2284.4 vp9_inv_dct_dct_32x32_add_8_1_sse2: 145.0 vp9_inv_dct_dct_32x32_add_8_1_ssse3: 137.4 vp9_inv_dct_dct_32x32_add_8_1_avx: 137.1 vp9_inv_dct_dct_32x32_add_8_1_avx2: 73.2 vp9_inv_dct_dct_32x32_add_8_2_c: 14680.8 vp9_inv_dct_dct_32x32_add_8_2_sse2: 2617.2 vp9_inv_dct_dct_32x32_add_8_2_ssse3: 982.9 vp9_inv_dct_dct_32x32_add_8_2_avx: 958.5 vp9_inv_dct_dct_32x32_add_8_2_avx2: 704.2 vp9_inv_dct_dct_32x32_add_8_4_c: 14443.1 vp9_inv_dct_dct_32x32_add_8_4_sse2: 2717.1 vp9_inv_dct_dct_32x32_add_8_4_ssse3: 965.7 vp9_inv_dct_dct_32x32_add_8_4_avx: 1000.7 vp9_inv_dct_dct_32x32_add_8_4_avx2: 717.1 vp9_inv_dct_dct_32x32_add_8_8_c: 14436.4 vp9_inv_dct_dct_32x32_add_8_8_sse2: 2671.8 vp9_inv_dct_dct_32x32_add_8_8_ssse3: 1038.5 vp9_inv_dct_dct_32x32_add_8_8_avx: 983.0 vp9_inv_dct_dct_32x32_add_8_8_avx2: 729.4 vp9_inv_dct_dct_32x32_add_8_16_c: 14614.7 vp9_inv_dct_dct_32x32_add_8_16_sse2: 2701.7 vp9_inv_dct_dct_32x32_add_8_16_ssse3: 1334.4 vp9_inv_dct_dct_32x32_add_8_16_avx: 1276.7 vp9_inv_dct_dct_32x32_add_8_16_avx2: 719.5 vp9_inv_dct_dct_32x32_add_8_32_c: 14363.6 vp9_inv_dct_dct_32x32_add_8_32_sse2: 2575.6 vp9_inv_dct_dct_32x32_add_8_32_ssse3: 2633.9 vp9_inv_dct_dct_32x32_add_8_32_avx: 2539.6 vp9_inv_dct_dct_32x32_add_8_32_avx2: 1395.0
* | x86/diracdsp: make ff_put_signed_rect_clamped_10_sse4 work on x86_32James Almer2016-07-20
| | | | | | | | | | Reviewed-by: Rostislav Pehlivanov <atomnuker@gmail.com> Signed-off-by: James Almer <jamrial@gmail.com>
* | diracdsp_init: add missing ARCH_X86_64 checkRostislav Pehlivanov2016-07-12
| | | | | | | | | | | | That SIMD is still x86_64 only for now. Signed-off-by: Rostislav Pehlivanov <atomnuker@gmail.com>
* | diracdsp: add SIMD for the 10 bit version of put_signed_rect_clampedRostislav Pehlivanov2016-07-11
| | | | | | | | Signed-off-by: Rostislav Pehlivanov <rpehlivanov@obe.tv>
* | diracdsp: add dequantization SIMDRostislav Pehlivanov2016-07-11
| | | | | | | | | | | | Currently unused, to be used in the following commits. Signed-off-by: Rostislav Pehlivanov <rpehlivanov@obe.tv>
* | vp9: add 16x16 idct avx2 (8-bit).Ronald S. Bultje2016-07-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | checkasm --bench, 10k runs, for *_add_${bpc}_${sub_idct}_${opt}, shows that it's about 1.65x as fast as the AVX version for the full IDCT, and similar speedups for the sub-IDCTs: nop: 24.6 vp9_inv_dct_dct_16x16_add_8_1_c: 6444.8 vp9_inv_dct_dct_16x16_add_8_1_sse2: 638.6 vp9_inv_dct_dct_16x16_add_8_1_ssse3: 484.4 vp9_inv_dct_dct_16x16_add_8_1_avx: 661.2 vp9_inv_dct_dct_16x16_add_8_1_avx2: 311.5 vp9_inv_dct_dct_16x16_add_8_2_c: 6665.7 vp9_inv_dct_dct_16x16_add_8_2_sse2: 646.9 vp9_inv_dct_dct_16x16_add_8_2_ssse3: 455.2 vp9_inv_dct_dct_16x16_add_8_2_avx: 521.9 vp9_inv_dct_dct_16x16_add_8_2_avx2: 304.3 vp9_inv_dct_dct_16x16_add_8_4_c: 7022.7 vp9_inv_dct_dct_16x16_add_8_4_sse2: 647.4 vp9_inv_dct_dct_16x16_add_8_4_ssse3: 467.1 vp9_inv_dct_dct_16x16_add_8_4_avx: 446.1 vp9_inv_dct_dct_16x16_add_8_4_avx2: 297.0 vp9_inv_dct_dct_16x16_add_8_8_c: 6800.4 vp9_inv_dct_dct_16x16_add_8_8_sse2: 598.6 vp9_inv_dct_dct_16x16_add_8_8_ssse3: 465.7 vp9_inv_dct_dct_16x16_add_8_8_avx: 440.9 vp9_inv_dct_dct_16x16_add_8_8_avx2: 290.2 vp9_inv_dct_dct_16x16_add_8_16_c: 6626.6 vp9_inv_dct_dct_16x16_add_8_16_sse2: 599.5 vp9_inv_dct_dct_16x16_add_8_16_ssse3: 475.0 vp9_inv_dct_dct_16x16_add_8_16_avx: 469.9 vp9_inv_dct_dct_16x16_add_8_16_avx2: 286.4
* | Merge commit 'f1a9eee41c4b5ea35db9ff0088ce4e6f1e187f2c'Clément Bœsch2016-07-09
|\| | | | | | | | | | | | | * commit 'f1a9eee41c4b5ea35db9ff0088ce4e6f1e187f2c': x86: Add missing movsxd for the int stride parameter Merged-by: Clément Bœsch <u@pkh.me>
| * x86: Add missing movsxd for the int stride parameterMartin Storsjö2016-06-17
| | | | | | | | Signed-off-by: Martin Storsjö <martin@martin.st>
| * asm: FF_-prefix internal macros used in inline assemblyDiego Biurrun2016-05-28
| | | | | | | | | | These warnings conflict with system macros on Solaris, producing truckloads of warnings about macro redefinition.
* | x86/dcadsp: optimize lfe_fir0_float_fma3 on x86_32James Almer2016-07-05
| | | | | | | | | | | | About 10% faster. Signed-off-by: James Almer <jamrial@gmail.com>
* | avcodec: add missing xmm/neon clobber test wrappers for the new decode APIJames Almer2016-07-03
| | | | | | | | | | Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: James Almer <jamrial@gmail.com>
* | asm: FF_-prefix internal macros used in inline assemblyMatthieu Bouron2016-06-27
| | | | | | | | See merge commit '39d6d3618d48625decaff7d9bdbb45b44ef2a805'.
* | Merge commit 'dc40a70c5755bccfb1a1349639943e1f408bea50'Hendrik Leppkes2016-06-26
|\| | | | | | | | | | | | | * commit 'dc40a70c5755bccfb1a1349639943e1f408bea50': Drop unnecessary libavutil/x86/asm.h #includes Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>
| * Drop unnecessary libavutil/x86/asm.h #includesDiego Biurrun2016-05-28
| |
* | Merge commit 'a6a750c7ef240b72ce01e9653343a0ddf247d196'Clément Bœsch2016-06-22
|\| | | | | | | | | | | | | * commit 'a6a750c7ef240b72ce01e9653343a0ddf247d196': tests: Move all test programs to a subdirectory Merged-by: Clément Bœsch <clement@stupeflix.com>
| * tests: Move all test programs to a subdirectoryDiego Biurrun2016-05-13
| |
* | Merge commit '41ed7ab45fc693f7d7fc35664c0233f4c32d69bb'Clément Bœsch2016-06-21
|\| | | | | | | | | | | | | * commit '41ed7ab45fc693f7d7fc35664c0233f4c32d69bb': cosmetics: Fix spelling mistakes Merged-by: Clément Bœsch <u@pkh.me>
| * cosmetics: Fix spelling mistakesVittorio Giovara2016-05-04
| | | | | | | | Signed-off-by: Diego Biurrun <diego@biurrun.de>
| * build: miscellaneous cosmeticsDiego Biurrun2016-04-07
| | | | | | | | | | | | Restore alphabetical order in lists, break overly long lines, do some prettyprinting, add some explanatory section comments, group parts together that belong together logically.
| * fft: Split MDCT bits off from FFTDiego Biurrun2016-03-01
| |
* | x86/aacpsdsp: optimize add_squares loopJames Almer2016-06-14
| | | | | | | | Signed-off-by: James Almer <jamrial@gmail.com>
* | x86/aacdec: use HADDPS macroJames Almer2016-06-08
| | | | | | | | Signed-off-by: James Almer <jamrial@gmail.com>
* | x86: lossless audio: SSE4 madd 32bitsChristophe Gisquet2016-05-07
| | | | | | | | | | | | | | | | | | | | The unique user so far is wmalossless 24bits. The few samples tested show an order of 8, so more unrolling or an avx2 version do not make sense. Timings: 68 -> 49 cycles Reviewed-by: Paul B Mahol <onemda@gmail.com> Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
* | Merge commit '73ff983e8dd22ccee166403d0bbbc9c1cd543622'Derek Buitenhuis2016-04-12
|\| | | | | | | | | | | | | * commit '73ff983e8dd22ccee166403d0bbbc9c1cd543622': fft: x86: cosmetics: Drop silly comments, add comment, whitespace Merged-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>
| * fft: x86: cosmetics: Drop silly comments, add comment, whitespaceDiego Biurrun2016-02-26
| |
| * x86: hevc: Fix linking with both yasm and optimizations disabledDiego Biurrun2016-02-23
| | | | | | | | | | Some optimized functions reference optimized symbols, so the functions must be explicitly disabled when those symbols are unavailable.
* | avcodec/fft: Add revtab32 for FFTs with more than 65536 samplesMichael Niedermayer2016-03-04
| | | | | | | | | | | | x86 optimizations are used only for the cases they support (<=65536 samples) Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
* | avcodec: Extend fft to size 2^17Michael Niedermayer2016-03-04
| | | | | | | | | | | | Asked-for-by: durandal_1707 Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
* | x86/vc1dsp: Split the file into MC and loopfilterTimothy Gu2016-02-29
| |
* | Merge commit '15a24614aef5836af3cd2c7cc3b2b737eee6bf3c'Derek Buitenhuis2016-02-24
|\| | | | | | | | | | | | | * commit '15a24614aef5836af3cd2c7cc3b2b737eee6bf3c': build: Add vc1dsp component for more fine-grained dependencies Merged-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>
| * build: Add vc1dsp component for more fine-grained dependenciesDiego Biurrun2016-02-19
| |
* | x86/dcadec: add ff_lfe_fir1_float_{sse3,avx}James Almer2016-02-22
| | | | | | | | | | Reviewed-by: Christophe Gisquet <christophe.gisquet@gmail.com> Signed-off-by: James Almer <jamrial@gmail.com>
* | Merge commit 'e280fe13291e9c712a5f4aa13b5263f3e8afed45'Derek Buitenhuis2016-02-16
|\| | | | | | | | | | | | | * commit 'e280fe13291e9c712a5f4aa13b5263f3e8afed45': v210: Use separate sample_factors Merged-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>
| * v210: Use separate sample_factorsLuca Barbato2016-02-01
| | | | | | | | | | | | | | The 10bit and the 8bit functions can now be implemented to process a different amount of samples. And while at it simplify a little the code.
| * v210: Add avx2 version of the 10-bit line encoderJames Darnley2016-02-01
| | | | | | | | | | | | Around 25% faster than the ssse3 version. Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
| * v210: Add avx2 version of the 8-bit line encoderJames Darnley2016-02-01
| | | | | | | | | | | | | | Around 35% faster than the avx version. Signed-off-by: Henrik Gramner <henrik@gramner.com> Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
* | Merge commit 'eafb05fcf37cd19a910ca3b17824384f9006bc0a'Derek Buitenhuis2016-02-16
|\| | | | | | | | | | | | | * commit 'eafb05fcf37cd19a910ca3b17824384f9006bc0a': v210: x86: Add the correct guards around the asm code Merged-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>
| * v210: x86: Add the correct guards around the asm codeLuca Barbato2016-01-26
| | | | | | | | Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
| * x86inc: Add debug symbols indicating sizes of compiled functionsGeza Lore2016-01-23
| | | | | | | | | | | | | | | | | | | | | | Some debuggers/profilers use this metadata to determine which function a given instruction is in; without it they get can confused by local labels (if you haven't stripped those). On the other hand, some tools are still confused even with this metadata. e.g. this fixes `gdb`, but not `perf`. Currently only implemented for ELF. Signed-off-by: Anton Khirnov <anton@khirnov.net>