summaryrefslogtreecommitdiff
path: root/libavutil/x86
Commit message (Collapse)AuthorAge
* libavutil: include assembly with full path from source rootAlexander Kanavin2022-02-08
| | | | | | | | Otherwise nasm writes the full host-specific paths into .o output, which breaks binary reproducibility. Signed-off-by: Alexander Kanavin <alex.kanavin@gmail.com> Signed-off-by: Anton Khirnov <anton@khirnov.net>
* lavu/tx: refactor assembly codelet definitionLynne2022-02-07
| | | | | | | | | | | This commit does some refactoring to make defining assembly codelets smaller, and fixes compiler redefinition warnings. It also allows for other assembly versions to reuse the same boilerplate code as x86. Finally, it also adds the out_of_place flag to all assembly codelets. This changes nothing, as out-of-place operation was assumed to be available anyway, but this makes it more explicit.
* x86/tx_float: avoid redefining macrosLynne2022-02-02
| | | | FFT16_FN was used for fft8 and for fft16 afterwards.
* x86/tx_float: mark AVX2 functions as AVXSLOWLynne2022-01-29
| | | | | | | | | | | | | Makes Bulldozer prefer AVX functions rather than AVX2, which are 64% slower: AVX: 117653 decicycles in av_tx (fft), 1048535 runs, 41 skips AVX2: 193385 decicycles in av_tx (fft), 1048561 runs, 15 skips The only difference between both is that vgatherdpd is used in the former. We don't want to mark them with the new SLOW_GATHER flag however, since gathers are still faster on Haswell/Zen 2/3 than plain loads.
* x86/tx_float: add missing FF_TX_OUT_OF_PLACE flag to functionsLynne2022-01-27
| | | | This caused smaller length dedicated transforms to not be picked up.
* x86/tx_float: do not build tx_float_init.c if x86 assembly is disabledLynne2022-01-27
| | | | | | | | This broke builds with --disable-mmx, which also disabled assembly entirely, but ARCH_X86 was still true, so the init file tried to find assembly that didn't exist. Instead of checking for architecture, check if external x86 assembly is enabled.
* x86/tx_float: add permute-free FFT versionsLynne2022-01-26
| | | | These are used in the PFA transforms and MDCTs.
* lavu/tx: rewrite internal code as a tree-based codelet constructorLynne2022-01-26
| | | | | | | | | | This commit rewrites the internal transform code into a constructor that stitches transforms (codelets). This allows for transforms to reuse arbitrary parts of other transforms, and allows transforms to be stacked onto one another (such as a full iMDCT using a half-iMDCT which in turn uses an FFT). It also permits for each step to be individually replaced by assembly or a custom implementation (such as an ASIC).
* avutil/cpu: move slow gather checks below in the functionJames Almer2021-12-21
| | | | | | Put them together with other similar slow flag checks. Signed-off-by: James Almer <jamrial@gmail.com>
* libavutil/cpu: Add AV_CPU_FLAG_SLOW_GATHER.Alan Kelly2021-12-21
| | | | This flag is set on Haswell and earlier and all AMD cpus.
* x86/intmath: add VEX encoded versions of av_clipf() and av_clipd()James Almer2021-11-19
| | | | | | | Prevents mixing inlined SSE instructions and AVX instructions when the compiler generates the latter. Signed-off-by: James Almer <jamrial@gmail.com>
* libavutil/common: clip nan value to aminMark Reid2021-11-15
| | | | | | | | | | | | Changes av_clipf to return amin if a is nan. Before if a is nan av_clipf_c returned nan and av_clipf_sse would return amax. Now the both should behave the same. This works because nan > amin is false. The max(nan, amin) will be amin. Signed-off-by: James Almer <jamrial@gmail.com>
* x86/tx_float: correctly load the transform lengthLynne2021-07-18
| | | | | | | | The field is a standard field, yet we were loading it as if it was a quadword. This worked for forward transforms by chance, but broke when the transform was inverse. checkasm couldn't catch that because we only test forward transforms, which are identical to inverse transforms but with a different revtab.
* x86/tx_float: remove ff_ prefix from external constant tablesJames Almer2021-04-25
| | | | | | | Fixes compilation with some assemblers. Reviewed-by: Lynne Signed-off-by: James Almer <jamrial@gmail.com>
* x86/tx_float: fix forgotten 2-argument mulpsLynne2021-04-24
| | | | Yasm *really* cannot deal with any omitted arguments at all.
* x86/tx_float: use all arguments on vperm2f and vpermilps and reindent commentsLynne2021-04-24
| | | | Apparently even old nasm isn't required to accept incomplete instructions.
* x86/tx_float: Fixes compilation with old yasmJames Almer2021-04-24
| | | | | | | Use three operand format on some instructions, and lea to load effective addresses of tables. Signed-off-by: James Almer <jamrial@gmail.com>
* lavu/x86/tx_float: fix FMA3 implying AVX2 is availableLynne2021-04-24
| | | | It's the other way around - AVX2 implies FMA3 is available.
* lavu/x86: add FFT assemblyLynne2021-04-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit adds a pure x86 assembly SIMD version of the FFT in libavutil/tx. The design of this pure assembly FFT is pretty unconventional. On the lowest level, instead of splitting the complex numbers into real and imaginary parts, we keep complex numbers together but split them in terms of parity. This saves a number of shuffles in each transform, but more importantly, it splits each transform into two independent paths, which we process using separate registers in parallel. This allows us to keep all units saturated and lets us use all available registers to avoid dependencies. Moreover, it allows us to double the granularity of our per-load permutation, skipping many expensive lookups and allowing us to use just 4 loads per register, rather than 8, or in case FMA3 (and by extension, AVX2), use the vgatherdpd instruction, which is at least as fast as 4 separate loads on old hardware, and quite a bit faster on modern CPUs). Higher up, we go for a bottom-up construction of large transforms, foregoing the traditional per-transform call-return recursion chains. Instead, we always start at the bottom-most basis transform (in this case, a 32-point transform), and continue constructing larger and larger transforms until we return to the top-most transform. This way, we only touch the stack 3 times per a complete target transform: once for the 1/2 length transform and two times for the 1/4 length transform. The combination algorithm we use is a standard Split-Radix algorithm, as used in our C code. Although a version with less operations exists (Steven G. Johnson and Matteo Frigo's "A modified split-radix FFT with fewer arithmetic operations", IEEE Trans. Signal Process. 55 (1), 111–119 (2007), which is the one FFTW uses), it only has 2% less operations and requires at least 4x the binary code (due to it needing 4 different paths to do a single transform). That version also has other issues which prevent it from being implemented with SIMD code as efficiently, which makes it lose the marginal gains it offered, and cannot be performed bottom-up, requiring many recursive call-return chains, whose overhead adds up. We go through a lot of effort to minimize load/stores by keeping as much in registers in between construcring transforms. This saves us around 32 cycles, on paper, but in reality a lot more due to load/store aliasing (a load from a memory location cannot be issued while there's a store pending, and there are only so many (2 for Zen 3) load/store units in a CPU). Also, we interleave coefficients during the last stage to save on a store+load per register. Each of the smallest, basis transforms (4, 8 and 16-point in our case) has been extremely optimized. Our 8-point transform is barely 20 instructions in total, beating our old implementation 8-point transform by 1 instruction. Our 2x8-point transform is 23 instructions, beating our old implementation by 6 instruction and needing 50% less cycles. Our 16-point transform's combination code takes slightly more instructions than our old implementation, but makes up for it by requiring a lot less arithmetic operations. Overall, the transform was optimized for the timings of Zen 3, which at the time of writing has the most IPC from all documented CPUs. Shuffles were preferred over arithmetic operations due to their 1/0.5 latency/throughput. On average, this code is 30% faster than our old libavcodec implementation. It's able to trade blows with the previously-untouchable FFTW on small transforms, and due to its tiny size and better prediction, outdoes FFTW on larger transforms by 11% on the largest currently supported size.
* Include attributes.h directlyAndreas Rheinhardt2021-04-19
| | | | | | | | Some files currently rely on libavutil/cpu.h to include it for them; yet said file won't use include it any more after the currently deprecated functions are removed, so include attributes.h directly. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
* avutil/x86inc: fix warnings when assembling with Nasm 2.15Henrik Gramner2020-07-12
| | | | | | | | | Some new warnings regarding use of empty macro parameters has been added, so adjust some x86inc code to silence those. Fixes part of ticket #8771 Signed-off-by: James Almer <jamrial@gmail.com>
* libavutil: x86: Include stdlib.h before using _byteswap_ulongMartin Storsjö2020-01-23
| | | | | | | When clang works in MSVC mode, it does have the _byteswap_ulong builtin, but one has to include stdlib.h before using it. Signed-off-by: Martin Storsjö <martin@martin.st>
* x86/float_dsp: add ff_vector_dmul_{sse2,avx}James Almer2018-09-14
| | | | | | ~3x to 5x faster. Signed-off-by: James Almer <jamrial@gmail.com>
* x86/pixelutils: don't use the AVX2 functions on CPUs known to be slow with themJames Almer2018-07-31
| | | | Signed-off-by: James Almer <jamrial@gmail.com>
* x86/pixelutils: add missing preprocessor wrapper to the AVX2 functionsJames Almer2018-07-31
| | | | | | Should fix compilation with old yasm/nasm Signed-off-by: James Almer <jamrial@gmail.com>
* avutil/pixelutils: sad_32x32 sse2/avx2 optimizations.Jun Zhao2018-07-31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | add ff_pixelutils_sad_32x32_sse2, ff_pixelutils_sad_{a,u}_32x32_sse2, ff_pixelutils_sad_32x32_avx22, ff_pixelutils_sad_{a,u}_32x32_avx2 use perf record/report profiling, get instructions:u for avx2 sad_32x32: 72.05% pixelutils pixelutils [.] block_sad_32x32_c 18.50% pixelutils pixelutils [.] block_sad_16x16_c 4.78% pixelutils pixelutils [.] block_sad_8x8_c 2.69% pixelutils pixelutils [.] block_sad_4x4_c 0.89% pixelutils pixelutils [.] block_sad_2x2_c 0.16% pixelutils pixelutils [.] ff_pixelutils_sad_32x32_avx2 0.16% pixelutils pixelutils [.] ff_pixelutils_sad_u_32x32_avx2 0.12% pixelutils pixelutils [.] ff_pixelutils_sad_a_32x32_avx2 sse2 sad_32x32 instructions:u like: 71.86% pixelutils pixelutils [.] block_sad_32x32_c 18.42% pixelutils pixelutils [.] block_sad_16x16_c 4.81% pixelutils pixelutils [.] block_sad_8x8_c 2.68% pixelutils pixelutils [.] block_sad_4x4_c 0.88% pixelutils pixelutils [.] block_sad_2x2_c 0.29% pixelutils pixelutils [.] ff_pixelutils_sad_32x32_sse2 0.26% pixelutils pixelutils [.] ff_pixelutils_sad_u_32x32_sse2 0.23% pixelutils pixelutils [.] ff_pixelutils_sad_a_32x32_sse2 Signed-off-by: Jun Zhao <mypopydev@gmail.com>
* lavu/x86/cpu: Fix aesni detectionalexander schmid2018-07-19
|
* avutil/pixelutils: correct the function name in commentsJun Zhao2018-07-11
| | | | Signed-off-by: Jun Zhao <mypopydev@gmail.com>
* Merge commit '4cf84e254ae75b524e1cacae499a97d7cc9e5906'James Almer2018-02-11
|\ | | | | | | | | | | | | * commit '4cf84e254ae75b524e1cacae499a97d7cc9e5906': Drop some unnecessary config.h #includes Merged-by: James Almer <jamrial@gmail.com>
| * Drop some unnecessary config.h #includesDiego Biurrun2018-02-06
| |
| * cpu: split flag checks per arch in av_cpu_max_align()James Almer2017-10-09
| | | | | | | | | | Signed-off-by: James Almer <jamrial@gmail.com> Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
* | x86inc: Drop cpuflags_slowctzHenrik Gramner2018-01-20
| |
* | x86inc: Correctly set mmreg variablesHenrik Gramner2018-01-20
| |
* | x86inc: Support creating global symbols from local labelsHenrik Gramner2018-01-20
| | | | | | | | | | On ELF platforms such symbols needs to be flagged as functions with the correct visibility to please certain linkers in some scenarios.
* | x86inc: Use .rdata instead of .rodata on WindowsHenrik Gramner2018-01-20
| | | | | | | | | | The standard section for read-only data on Windows is .rdata. Nasm will flag non-standard sections as executable by default which isn't ideal.
* | x86inc: Enable AVX emulation for floating-point pseudo-instructionsHenrik Gramner2018-01-20
| | | | | | | | | | | | | | There are 32 pseudo-instructions for each floating-point comparison instruction, but only 8 of them are actually valid in legacy-encoded mode. The remaining 24 requires the use of VEX-encoded (v-prefixed) instructions and can therefore be disregarded for this purpose.
* | x86inc: set the correct amount of simd regs in x86_64 when avx512 is enabled ↵James Almer2017-12-24
| | | | | | | | | | | | | | | | | | but not used Fixes compilation of libavresample/x86/audio_mix.asm Reviewed-by: Gramner Signed-off-by: James Almer <jamrial@gmail.com>
* | x86inc: AVX-512 supportHenrik Gramner2017-12-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | AVX-512 consists of a plethora of different extensions, but in order to keep things a bit more manageable we group together the following extensions under a single baseline cpu flag which should cover SKL-X and future CPUs: * AVX-512 Foundation (F) * AVX-512 Conflict Detection Instructions (CD) * AVX-512 Byte and Word Instructions (BW) * AVX-512 Doubleword and Quadword Instructions (DQ) * AVX-512 Vector Length Extensions (VL) On x86-64 AVX-512 provides 16 additional vector registers, prefer using those over existing ones since it allows us to avoid using `vzeroupper` unless more than 16 vector registers are required. They also happen to be volatile on Windows which means that we don't need to save and restore existing xmm register contents unless more than 22 vector registers are required. Big thanks to Intel for their support.
* | avutil: add alignment needed for AVX-512James Darnley2017-12-24
| |
* | avutil: detect when AVX-512 is availableJames Darnley2017-12-24
| |
* | avutil: add AVX-512 flagsJames Darnley2017-12-24
| |
* | avutil/x86util : add macro for loading a 128 bits constants in an xmm or in ↵Martin Vignali2017-12-02
| | | | | | | | each part of an ymm in order to simplify avx2 asm func
* | Don't use _tzcnt instrinics with clang for windows w/o BMI.Dale Curtis2017-10-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Technically _tzcnt* intrinsics are only available when the BMI instruction set is present. However the instruction encoding degrades to "rep bsf" on older processors. Clang for Windows debatably restricts the _tzcnt* instrinics behind the __BMI__ architecture define, so check for its presence or exclude the usage of these intrinics when clang is present. See also: https://ffmpeg.org/pipermail/ffmpeg-devel/2015-November/183404.html https://bugs.llvm.org/show_bug.cgi?id=30506 http://lists.llvm.org/pipermail/cfe-dev/2016-October/051034.html Signed-off-by: Dale Curtis <dalecurtis@chromium.org> Reviewed-by: Matt Oliver <protogonoi@gmail.com> Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
* | Merge commit '994c4bc10751e39c7ed9f67ffd0c0dea5223daf2'James Almer2017-10-21
|\| | | | | | | | | | | | | | | | | * commit '994c4bc10751e39c7ed9f67ffd0c0dea5223daf2': x86util: Port all macros to cpuflags See d5f8a642f6eb1c6e305c41dabddd0fd36ffb3f77 Merged-by: James Almer <jamrial@gmail.com>
| * x86util: Port all macros to cpuflagsDiego Biurrun2017-03-14
| | | | | | | | | | | | Also do some small cosmetic changes: Drop pointless _MMX suffix from ABSD2 macro name, drop pointless check for MMX support, we always assume MMX is available in our SIMD code, fix spelling.
| * build: Generalize yasm/nasm-related variable namesDiego Biurrun2017-03-01
| | | | | | | | None of them are specific to the YASM assembler.
* | avutil/cpu: split flag checks per arch in av_cpu_max_align()James Almer2017-09-27
| | | | | | | | Signed-off-by: James Almer <jamrial@gmail.com>
* | Merge commit '7abdd026df6a9a52d07d8174505b33cc89db7bf6'James Almer2017-09-26
|\| | | | | | | | | | | | | * commit '7abdd026df6a9a52d07d8174505b33cc89db7bf6': asm: Consistently uppercase SECTION markers Merged-by: James Almer <jamrial@gmail.com>
| * asm: Consistently uppercase SECTION markersDiego Biurrun2017-02-03
| |
| * x86inc: Avoid using eax/rax for storing the stack pointerHenrik Gramner2017-01-09
| | | | | | | | | | | | | | | | | | | | | | When allocating stack space with an alignment requirement that is larger than the current stack alignment we need to store a copy of the original stack pointer in order to be able to restore it later. If we chose to use another register for this purpose we should not pick eax/rax since it can be overwritten as a return value. Signed-off-by: Anton Khirnov <anton@khirnov.net>