| Commit message (Collapse) | Author | Age |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The vector dequantization has a test in a loop preventing effective SIMD
implementation. By moving it out of the loop, this loop can be DSPized.
Therefore, modify the current DSP implementation. In particular, the
DSP implementation no longer has to handle null loop sizes.
The decode_hf implementations have following timings:
For x86 Arrandale:
C SSE SSE2 SSE4
win32: 260 162 119 104
win64: 242 N/A 89 72
The arm NEON optimizations follow in a later patch as external asm. The
now unused check for the y modifier in arm inline asm is removed from
configure.
|
|
|
|
|
|
|
|
|
| |
Based on a patch from Christophe Gisquet.
Unrolling of the m == 0 case avoids a possible use of the uninitilized
value sum when s->predictor_history is not set. I failed to find a
sample for it. It also reduced the cycle count from 220 to 150 on
sandy bridge, x86_64 linux, gcc 4.8.2 compared to his patch.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Timings for Arrandale:
C SSE
win32: 2108 334
win64: 1152 322
Factorizing the inner loop with a call/jmp is a >15 cycles cost, even with
the jmp destination being aligned.
Unrolling for ARCH_X86_64 is a 20 cycles gain.
Signed-off-by: Janne Grunau <janne-libav@jannau.net>
|
|
|
|
|
|
| |
This change is inspired by x86 asm where it frees a register.
Signed-off-by: Janne Grunau <janne-libav@jannau.net>
|
|
|
|
|
|
|
|
| |
Results for Arrandale/Windows:
32: 1670 -> 316
64: 728 -> 298
Signed-off-by: Janne Grunau <janne-libav@jannau.net>
|
|
|
|
|
|
|
| |
The scaling factor is constant so it is faster to scale the
FIR coefficients in the tables during compilation.
Signed-off-by: Janne Grunau <janne-libav@jannau.net>
|
| |
|
|
|
|
| |
None of the encoder bits are arch-optimized.
|
|
|
|
| |
No permutation is necessary for the FDCT.
|
| |
|
|
|
|
| |
This also avoids a macro name clash and related warning on ARM.
|
|
|
|
| |
These are already covered through dependencies specified in configure.
|
| |
|
|
|
|
| |
Signed-off-by: Diego Biurrun <diego@biurrun.de>
|
| |
|
|
|
|
| |
Signed-off-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>
|
|
|
|
| |
Signed-off-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>
|
|
|
|
| |
Signed-off-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>
|
|
|
|
|
|
|
| |
Framerate is now a sane rational instead of an integer, and
inputDepth is changed to what it actually is.
Signed-off-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>
|
| |
|
| |
|
| |
|
| |
|
|
|
|
| |
beta_offset is pre-multiplied by 2.
|
|
|
|
|
|
| |
f777504f640260337974848c7d5d7a3f064bbb45 changed a - in +
CC: libav-stable@libav.org
|
| |
|
|
|
|
|
|
|
|
|
|
| |
And use the value from the specification.
Sample-Id: 00000451-google
Found-by: Mateusz j00ru Jurczyk and Gynvael Coldwind
CC: libav-stable@libav.org
Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
|
|
|
|
| |
Based on e3fec3f095ab5ea08ee662942d98526aaf5e3635 for arm.
|
|
|
|
| |
Signed-off-by: Janne Grunau <janne-libav@jannau.net>
|
|
|
|
|
|
| |
Makes fate-h264 pass under valgrind --undef-value-errors=yes with
-cpuflags none. {avg,put}_h264_chroma_mc8_8 approximately 5% faster,
{avg,put}_h264_chroma_mc4_8 2% faster both on x86 and arm.
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
| |
Sample-Id: 00001667-google
Reported-by: Mateusz "j00ru" Jurczyk and Gynvael Coldwind
CC: libav-stable@libav.org
|
| |
|
| |
|
|
|
|
| |
Signed-off-by: Vittorio Giovara <vittorio.giovara@gmail.com>
|
|
|
|
| |
Signed-off-by: Vittorio Giovara <vittorio.giovara@gmail.com>
|
| |
|
| |
|
|
|
|
| |
Also drop support for building examples in library directories.
|
|
|
|
| |
Signed-off-by: Kostya Shishkov <kostya.shishkov@gmail.com>
|
| |
|
|
|
|
|
|
| |
'11b' is reserved in the A/52 specification,
but newer encoders use it to indicate a Dolby
Pro Logic II compatible Lt/Rt downmix.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
Altivec can only load naturally aligned vectors. To handle possibly
unaligned data a second vector is loaded from an offset of the original
location and the data is recovered through a vector permutation.
Overreads are minimal if the offset for second load points to the last
element of data. This is 7 for loading eight 8-bit pixels and overreads
are reduced from 16 bytes to 8 bytes if the pixels are 64-bit aligned.
For unaligned pixels the overread is reduced from 23 bytes to 15 bytes
in the worst case.
|
|
|
|
|
|
|
|
|
|
|
|
| |
The official Ut Video decoder only threads with slices, thus until
now any files encoded by the libavcodec encoder have only been
decodable with a single thread. The default slice count is now
set to subsampled_height / 120.
Also sets slices to 1 for the Ut Video encoder tests to keep them
green.
Signed-off-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>
|
|
|
|
| |
Signed-off-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>
|