| Commit message (Collapse) | Author | Age |
| |
|
|
|
|
|
|
|
| |
From 253 to 51 cycles on Arrandale and Win64.
44 cycles on SandyBridge.
Signed-off-by: Anton Khirnov <anton@khirnov.net>
|
|
|
|
|
|
|
|
|
|
| |
Sandybridge: 47 cycles
Having a loop counter is a 7 cycle gain.
Unrolling is another 7 cycle gain.
Working in reverse scan is another 6 cycles.
Signed-off-by: Diego Biurrun <diego@biurrun.de>
|
|
|
|
|
|
|
|
|
|
| |
Timing on Arrandale:
C SSE
Win32: 57 44
Win64: 47 38
Unrolling and not storing mask both save some cycles.
Signed-off-by: Diego Biurrun <diego@biurrun.de>
|
| |
|
|
|
|
|
|
| |
255 to 174 cycles on Arrandale / Win64. Unrolling yields no gain.
Signed-off-by: Diego Biurrun <diego@biurrun.de>
|
|
|
|
|
|
| |
698 to 174 cycles on Arrandale. Unrolling is a 6 cycles gain.
Signed-off-by: Diego Biurrun <diego@biurrun.de>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Start and end index are multiple of 2, therefore guaranteeing aligned access.
Also, this allows to generate 4 floats per loop, keeping the alignment all
along.
Timing:
- 32 bits: 326c -> 172c
- 64 bits: 323c -> 156c
Signed-off-by: Diego Biurrun <diego@biurrun.de>
|
|
|
|
|
| |
This separates code relying on inline from that relying on external
assembly and fixes instances where the coalesced check was incorrect.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Unrolling the main loop to process, instead of 4 elements:
- 8: minor gain of 2 cycles (not worth the extra object size)
- 2: loss of 8 cycles.
Assigning STEP to a register is a loss. Output address (Y) is almost always
unaligned.
Timings:
- C (32/64 bits): 117/109 cycles
- SSE: 57 cycles
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
|
|
The 32bits targets have been compiled with -mfpmath=sse for proper reference.
sbr_sum_square C /32bits: 82c (unrolled)/102c
C /64bits: 69c (unrolled)/82c
SSE/32bits: 42c
SSE/64bits: 31c
Use of SSE4.1 dpps to perform the final sum is slower.
Not unrolling to perform 8 operations in a loop yields 10 more cycles.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
|