summaryrefslogtreecommitdiff
path: root/libavcodec/x86/sbrdsp_init.c
Commit message (Collapse)AuthorAge
* x86: sbrdsp: Implement SSE qmf_post_shuffleChristophe Gisquet2013-01-06
| | | | | | 255 to 174 cycles on Arrandale / Win64. Unrolling yields no gain. Signed-off-by: Diego Biurrun <diego@biurrun.de>
* x86: sbrdsp: Implement SSE sum64x5Christophe Gisquet2013-01-06
| | | | | | 698 to 174 cycles on Arrandale. Unrolling is a 6 cycles gain. Signed-off-by: Diego Biurrun <diego@biurrun.de>
* SBR DSP x86: implement SSE sbr_hf_genChristophe Gisquet2012-12-07
| | | | | | | | | | | | Start and end index are multiple of 2, therefore guaranteeing aligned access. Also, this allows to generate 4 floats per loop, keeping the alignment all along. Timing: - 32 bits: 326c -> 172c - 64 bits: 323c -> 156c Signed-off-by: Diego Biurrun <diego@biurrun.de>
* x86: Replace checks for CPU extensions and flags by convenience macrosDiego Biurrun2012-09-08
| | | | | This separates code relying on inline from that relying on external assembly and fixes instances where the coalesced check was incorrect.
* SBR DSP x86: implement SSE sbr_hf_g_filtChristophe GISQUET2012-02-23
| | | | | | | | | | | | | | | Unrolling the main loop to process, instead of 4 elements: - 8: minor gain of 2 cycles (not worth the extra object size) - 2: loss of 8 cycles. Assigning STEP to a register is a loss. Output address (Y) is almost always unaligned. Timings: - C (32/64 bits): 117/109 cycles - SSE: 57 cycles Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
* SBR DSP x86: implement SSE sbr_sum_square_sseChristophe GISQUET2012-02-23
The 32bits targets have been compiled with -mfpmath=sse for proper reference. sbr_sum_square C /32bits: 82c (unrolled)/102c C /64bits: 69c (unrolled)/82c SSE/32bits: 42c SSE/64bits: 31c Use of SSE4.1 dpps to perform the final sum is slower. Not unrolling to perform 8 operations in a loop yields 10 more cycles. Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>