summaryrefslogtreecommitdiff
path: root/libavcodec/x86/sbrdsp.asm
Commit message (Collapse)AuthorAge
* dsputil x86: use SSE float instruction instead of SSE2 integer equivalentChristophe GISQUET2012-04-04
| | | | | | All the more required since the users are pure SSE functions. Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
* aacsbr: handle m_max values smaller than 4.Ronald S. Bultje2012-03-23
| | | | | | | | Prevents a signflip in the counter, and a subsequent crash because of overreads/overwrites. Found-by: Mateusz "j00ru" Jurczyk and Gynvael Coldwind CC: libav-stable@libav.org
* sbrdsp.asm: convert all instructions to float/SSE ones.Reimar Döffinger2012-03-07
| | | | | | | | | | | Since the values are floats, using the float operations makes sense, improves performance on some CPUs and makes the code SSE compatible instead of needing SSE2. Based on suggestion by Jason. Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de> Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
* SBR DSP: fix SSE code to not use SSE2 instructions.Reimar Döffinger2012-03-06
| | | | | | | | movq from SSE register _to_ memory is an SSE2 instruction. Use the SSE movlps function instead that does the same thing. Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de> Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
* SBR DSP x86: implement SSE sbr_hf_g_filtChristophe GISQUET2012-02-23
| | | | | | | | | | | | | | | Unrolling the main loop to process, instead of 4 elements: - 8: minor gain of 2 cycles (not worth the extra object size) - 2: loss of 8 cycles. Assigning STEP to a register is a loss. Output address (Y) is almost always unaligned. Timings: - C (32/64 bits): 117/109 cycles - SSE: 57 cycles Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
* SBR DSP x86: implement SSE sbr_sum_square_sseChristophe GISQUET2012-02-23
The 32bits targets have been compiled with -mfpmath=sse for proper reference. sbr_sum_square C /32bits: 82c (unrolled)/102c C /64bits: 69c (unrolled)/82c SSE/32bits: 42c SSE/64bits: 31c Use of SSE4.1 dpps to perform the final sum is slower. Not unrolling to perform 8 operations in a loop yields 10 more cycles. Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>