summaryrefslogtreecommitdiff
path: root/libavutil/x86/x86util.asm
Commit message (Collapse)AuthorAge
* avutil/x86util : add macro for loading a 128 bits constants in an xmm or in ↵Martin Vignali2017-12-02
| | | | each part of an ymm in order to simplify avx2 asm func
* Merge commit '994c4bc10751e39c7ed9f67ffd0c0dea5223daf2'James Almer2017-10-21
|\ | | | | | | | | | | | | | | | | * commit '994c4bc10751e39c7ed9f67ffd0c0dea5223daf2': x86util: Port all macros to cpuflags See d5f8a642f6eb1c6e305c41dabddd0fd36ffb3f77 Merged-by: James Almer <jamrial@gmail.com>
| * x86util: Port all macros to cpuflagsDiego Biurrun2017-03-14
| | | | | | | | | | | | Also do some small cosmetic changes: Drop pointless _MMX suffix from ABSD2 macro name, drop pointless check for MMX support, we always assume MMX is available in our SIMD code, fix spelling.
* | Add macros to x86util.asm .Ivan Kalvachev2017-08-18
| | | | | | | | | | | | | | | | | | Improved version of VBROADCASTSS that works like the avx2 instruction. Emulation of vpbroadcastd. Horizontal sum HSUMPS that places the result in all elements. Emulation of blendvps and pblendvb. Signed-off-by: Ivan Kalvachev <ikalvachev@gmail.com>
* | x86/aacpsdsp: add ff_ps_hybrid_synthesis_deint_{sse,sse4}James Almer2017-06-18
| | | | | | | | About 2x faster than the c version.
* | avutil/x86util: don't use movss in VBROADCASTSS macro when src and dst args ↵James Almer2017-03-21
| | | | | | | | | | | | | | are the same Reviewed-by: Henrik Gramner <henrik@gramner.com> Signed-off-by: James Almer <jamrial@gmail.com>
* | Merge commit '07e1f99a1bb41d1a615676140eefc85cf69fa793'Clément Bœsch2017-03-20
|\| | | | | | | | | | | | | * commit '07e1f99a1bb41d1a615676140eefc85cf69fa793': x86util: Document SBUTTERFLY macro Merged-by: Clément Bœsch <u@pkh.me>
| * x86util: Document SBUTTERFLY macroAlexandra Hájková2016-09-19
| | | | | | | | Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
| * x86util: Extend SPLATW for avx2James Almer2016-07-18
| | | | | | | | | | | | Integration to Libav by Josh de Kock <josh@itanimul.li>. Signed-off-by: Alexandra Hájková <alexandra@khirnov.net>
| * v210enc: Add SIMD optimised 8-bit and 10-bit encodersKieran Kunhya2014-12-05
| | | | | | | | | | Signed-off-by: Michael Niedermayer <michaelni@gmx.at> Signed-off-by: Vittorio Giovara <vittorio.giovara@gmail.com>
* | avcodec/h264: sse2, avx h luma mbaff deblock/loop filterJames Darnley2017-02-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | x86-64 only Yorkfield: - sse2: ~2.17x (434 vs. 200 cycles) Nehalem: - sse2: ~2.94x (409 vs. 139 cycles) Skylake: - sse2: ~3.10x (370 vs. 119 cycles) - avx: ~3.29x (370 vs. 112 cycles)
* | x86util: import MOVHL macroJames Darnley2017-02-18
| | | | | | | | | | | | | | | | | | | | Originally committed to x264 in 1637239a by Henrik Gramner who has agreed to re-license it as LGPL. Original commit message follows. x86: Avoid some bypass delays and false dependencies A bypass delay of 1-3 clock cycles may occur on some CPUs when transitioning between int and float domains, so try to avoid that if possible.
* | avcodec/x86: deduplicate PASS8ROWS macroJames Darnley2017-02-18
| |
* | vp9: add 16x16 idct avx2 (8-bit).Ronald S. Bultje2016-07-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | checkasm --bench, 10k runs, for *_add_${bpc}_${sub_idct}_${opt}, shows that it's about 1.65x as fast as the AVX version for the full IDCT, and similar speedups for the sub-IDCTs: nop: 24.6 vp9_inv_dct_dct_16x16_add_8_1_c: 6444.8 vp9_inv_dct_dct_16x16_add_8_1_sse2: 638.6 vp9_inv_dct_dct_16x16_add_8_1_ssse3: 484.4 vp9_inv_dct_dct_16x16_add_8_1_avx: 661.2 vp9_inv_dct_dct_16x16_add_8_1_avx2: 311.5 vp9_inv_dct_dct_16x16_add_8_2_c: 6665.7 vp9_inv_dct_dct_16x16_add_8_2_sse2: 646.9 vp9_inv_dct_dct_16x16_add_8_2_ssse3: 455.2 vp9_inv_dct_dct_16x16_add_8_2_avx: 521.9 vp9_inv_dct_dct_16x16_add_8_2_avx2: 304.3 vp9_inv_dct_dct_16x16_add_8_4_c: 7022.7 vp9_inv_dct_dct_16x16_add_8_4_sse2: 647.4 vp9_inv_dct_dct_16x16_add_8_4_ssse3: 467.1 vp9_inv_dct_dct_16x16_add_8_4_avx: 446.1 vp9_inv_dct_dct_16x16_add_8_4_avx2: 297.0 vp9_inv_dct_dct_16x16_add_8_8_c: 6800.4 vp9_inv_dct_dct_16x16_add_8_8_sse2: 598.6 vp9_inv_dct_dct_16x16_add_8_8_ssse3: 465.7 vp9_inv_dct_dct_16x16_add_8_8_avx: 440.9 vp9_inv_dct_dct_16x16_add_8_8_avx2: 290.2 vp9_inv_dct_dct_16x16_add_8_16_c: 6626.6 vp9_inv_dct_dct_16x16_add_8_16_sse2: 599.5 vp9_inv_dct_dct_16x16_add_8_16_ssse3: 475.0 vp9_inv_dct_dct_16x16_add_8_16_avx: 469.9 vp9_inv_dct_dct_16x16_add_8_16_avx2: 286.4
* | x86/showcqt: use three operand format for some instructionsJames Almer2016-06-08
| | | | | | | | | | | | Fixes failures with yasm 1.1.0 and older Signed-off-by: James Almer <jamrial@gmail.com>
* | avutil/x86util: move haddps sse emulation from showcqtJames Almer2016-06-08
| | | | | | | | Signed-off-by: James Almer <jamrial@gmail.com>
* | x86: port PSIGNW to cpuflagsJames Almer2015-09-11
| | | | | | | | | | Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: James Almer <jamrial@gmail.com>
* | x86: move XOP emulation code back to x86incJames Almer2015-08-03
| | | | | | | | | | | | | | | | | | | | Only two functions that use xop multiply-accumulate instructions where the first operand is the same as the fourth actually took advantage of the macros. This further reduces differences with x264's x86inc. Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: James Almer <jamrial@gmail.com>
* | x86/swr: add SSE2/AVX pack_8ch functionsJames Almer2014-12-30
| | | | | | | | | | | | Reviewed-by: Michael Niedermayer <michaelni@gmx.at> Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: James Almer <jamrial@gmail.com>
* | v210enc: Add SIMD optimised 8-bit and 10-bit encodersKieran Kunhya2014-11-26
| | | | | | | | Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
* | x86/hevc_deblock: improve 8bit transpose store macrosJames Almer2014-08-03
| | | | | | | | | | | | | | Up to four instructions less depending on function and instruction set. Signed-off-by: James Almer <jamrial@gmail.com> Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
* | x86/hevc_idct: replace old and unused idct functionsJames Almer2014-07-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Only 8-bit and 10-bit idct_dc() functions are included (adding others should be trivial). Benchmarks on an Intel Core i5-4200U: idct8x8_dc SSE2 MMXEXT C cycles 22 26 57 idct16x16_dc AVX2 SSE2 C cycles 27 32 249 idct32x32_dc AVX2 SSE2 C cycles 62 126 1375 Signed-off-by: James Almer <jamrial@gmail.com> Reviewed-by: Mickaël Raulet <mraulet@gmail.com> Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
* | x86util: add and use RSHIFT/LSHIFT macrosChristophe Gisquet2014-06-15
| | | | | | | | | | | | | | Those macros take a byte number as shift argument, as this argument differs between MMX and SSE2 instructions. Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
* | x86: hpeldsp: better factorizationChristophe Gisquet2014-05-29
| | | | | | | | Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
* | x86/dsputilenc: implement SSE2 versions of pix_{sum16, norm1}James Almer2014-05-28
| | | | | | | | | | Signed-off-by: James Almer <jamrial@gmail.com> Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
* | x86: move horizontal add macros to x86utilJames Almer2014-04-17
| | | | | | | | | | | | | | | | | | Also port relevant AVX2/XOP optimizations from x264 with permission to relicense to LGPL from the corresponding authors Signed-off-by: James Almer <jamrial@gmail.com> Reviewed-by: "Ronald S. Bultje" <rsbultje@gmail.com> Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
* | x86: Move XOP emulation to x86utilJames Almer2014-02-24
| | | | | | | | | | | | | | | | | | | | | | We need the emulation to support the cases where the first argument is the same as the fourth. To achieve this a fifth argument working as a temporary may be needed. Emulation that doesn't obey the original instruction semantics can't be in x86inc. Signed-off-by: James Almer <jamrial@gmail.com> Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
* | Merge commit 'c6908d6b4b377a04a5d055ba874bdbcf06c80497'Michael Niedermayer2013-10-14
|\| | | | | | | | | | | | | * commit 'c6908d6b4b377a04a5d055ba874bdbcf06c80497': x86inc: FMA3/4 Support Merged-by: Michael Niedermayer <michaelni@gmx.at>
| * x86inc: FMA3/4 SupportJason Garrett-Glaser2013-10-14
| | | | | | | | Signed-off-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>
* | Merge commit '206895708ea2b464755d340e44501daf9a07c310'Michael Niedermayer2013-10-14
|\| | | | | | | | | | | | | * commit '206895708ea2b464755d340e44501daf9a07c310': x86inc: Remove our FMA4 support Merged-by: Michael Niedermayer <michaelni@gmx.at>
| * x86inc: Remove our FMA4 supportDerek Buitenhuis2013-10-14
| | | | | | | | | | | | | | | | This is so we can sync to x264's version of FMA4 support. This partialy reverts commit 79687079a97a039c325ab79d7a95920d800b791f. Signed-off-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>
* | Merge commit 'd633d12b2cc999cee3ac25bf9a810fe7ff03726d'Michael Niedermayer2013-01-19
|\| | | | | | | | | | | | | * commit 'd633d12b2cc999cee3ac25bf9a810fe7ff03726d': x86inc: Add cvisible macro for C functions with public prefix Merged-by: Michael Niedermayer <michaelni@gmx.at>
| * x86inc: Add cvisible macro for C functions with public prefixDiego Biurrun2013-01-18
| | | | | | | | | | | | This allows defining externally visible library symbols. Signed-off-by: Diego Biurrun <diego@biurrun.de>
* | Merge commit 'ef5d41a5534b65f03d02f2e11a503ab8416bfc3b'Michael Niedermayer2013-01-19
|\| | | | | | | | | | | | | | | | | | | | | * commit 'ef5d41a5534b65f03d02f2e11a503ab8416bfc3b': x86inc: Rename "program_name" to "private_prefix" configure: Run SHFLAGS through ldflags_filter() Conflicts: configure Merged-by: Michael Niedermayer <michaelni@gmx.at>
| * x86inc: Rename "program_name" to "private_prefix"Diego Biurrun2013-01-18
| | | | | | | | | | | | | | The new name is more descriptive and will allow defining a separate public prefix for externally visible library symbols. Signed-off-by: Diego Biurrun <diego@biurrun.de>
* | Merge commit 'dae1d507af94261bafd3b11549884e5d1eca590e'Michael Niedermayer2013-01-16
|\| | | | | | | | | | | | | | | | | * commit 'dae1d507af94261bafd3b11549884e5d1eca590e': x86: Add PAVGB macro to abstract pavgb/pavgusb instruction via cpuflags vf_fps: add final flushed frames to the dropped frame count rv34_parser: Adjust #if for disabling individual parsers Merged-by: Michael Niedermayer <michaelni@gmx.at>
| * x86: Add PAVGB macro to abstract pavgb/pavgusb instruction via cpuflagsDiego Biurrun2013-01-15
| |
* | Merge remote-tracking branch 'qatar/master'Michael Niedermayer2013-01-15
|\| | | | | | | | | | | | | * qatar/master: x86: ABSB2: port to cpuflags Merged-by: Michael Niedermayer <michaelni@gmx.at>
| * x86: ABSB2: port to cpuflagsDiego Biurrun2013-01-15
| |
* | Merge commit '094a7405e5d8463d7d167d893e04934ec1a84ecd'Michael Niedermayer2013-01-15
|\| | | | | | | | | | | | | | | * commit '094a7405e5d8463d7d167d893e04934ec1a84ecd': x86: ABSB: port to cpuflags sdp: Include SRTP crypto params if using the srtp protocol Merged-by: Michael Niedermayer <michaelni@gmx.at>
| * x86: ABSB: port to cpuflagsDiego Biurrun2013-01-15
| |
* | Merge commit 'd8c772de53d29afb1bada88afa859fce8489c668'Michael Niedermayer2013-01-15
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | * commit 'd8c772de53d29afb1bada88afa859fce8489c668': nutdec: Always return a value from nut_read_timestamp() configure: Make warnings from -Wreturn-type fatal errors x86: ABS2: port to cpuflags vdpau: Remove av_unused attribute from function declaration h264: fix ff_generate_sliding_window_mmcos() prototype. Conflicts: configure libavformat/nutdec.c Merged-by: Michael Niedermayer <michaelni@gmx.at>
| * x86: ABS2: port to cpuflagsDiego Biurrun2013-01-14
| |
* | Merge commit '5b4dfbffc258f90a7d2540d21209ac23afcf7cd0'Michael Niedermayer2013-01-07
|\| | | | | | | | | | | | | | | * commit '5b4dfbffc258f90a7d2540d21209ac23afcf7cd0': x86: ABS1: port to cpuflags v210x: cosmetics, reformat Merged-by: Michael Niedermayer <michaelni@gmx.at>
| * x86: ABS1: port to cpuflagsDiego Biurrun2013-01-06
| |
* | Merge commit '9d5c62ba5b586c80af508b5914934b1c439f6652'Michael Niedermayer2012-12-06
|\| | | | | | | | | | | | | | | | | | | | | | | | | * commit '9d5c62ba5b586c80af508b5914934b1c439f6652': lavu/opt: do not filter out the initial sign character except for flags eval: treat dB as decibels instead of decibytes float_dsp: add vector_dmul_scalar() to multiply a vector of doubles Conflicts: libavutil/eval.c tests/ref/fate/eval Merged-by: Michael Niedermayer <michaelni@gmx.at>
| * float_dsp: add vector_dmul_scalar() to multiply a vector of doublesJustin Ruggles2012-12-05
| | | | | | | | Include x86-optimized versions for SSE2 and AVX.
* | Merge remote-tracking branch 'qatar/master'Michael Niedermayer2012-11-19
|\| | | | | | | | | | | | | | | * qatar/master: x86: h264_intrapred: Fix C function names in comments x86: SPLATD: port to cpuflags Merged-by: Michael Niedermayer <michaelni@gmx.at>
| * x86: SPLATD: port to cpuflagsDiego Biurrun2012-11-18
| |
* | Merge remote-tracking branch 'qatar/master'Michael Niedermayer2012-11-14
|\| | | | | | | | | | | | | | | | | | | | | | | * qatar/master: x86: mmx2 ---> mmxext in asm constructs Conflicts: libavcodec/x86/h264_chromamc_10bit.asm libavcodec/x86/h264_deblock.asm libavcodec/x86/h264dsp_init.c Merged-by: Michael Niedermayer <michaelni@gmx.at>