| Commit message (Collapse) | Author | Age |
... | |
|
|
|
| |
Signed-off-by: Diego Biurrun <diego@biurrun.de>
|
| |
|
|
|
|
|
|
| |
This was missed in the the previous commit in 70a1c800.
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
|
|
| |
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
|
| |
|
| |
|
|
|
|
|
|
| |
This gets rid of a variable-length array and a for loop in C code.
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The problem is that the ssse3 psign instruction does the wrong
thing here. Commit ea60dfe incorrectly removed a macro emulating
this instruction for pre-ssse3 code. However, the emulation is
incorrect, and the code relies on the behaviour of the macro.
Specifically, the psign sets destination elements to zero where
the corresponding source element is zero, whereas the emulation
only negates destination elements where the source is negative.
Furthermore, the PSIGNW_MMX macro in x86util.asm is totally bogus,
which is why the original VC-1 code had an additional right shift
when using it. Since the psign instruction cannot be used here,
skip all the macro hell and use the working instruction sequence
directly.
None of this was noticed due a stray return statement in
ff_vc1dsp_init_mmx() which meant that only the mmx version of the
loop filter was ever used (before being removed in ea60dfe).
Signed-off-by: Mans Rullgard <mans@mansr.com>
|
|
|
|
|
|
|
|
|
| |
The function call was a mess to handle, and memcpy cannot make
the assumptions we do in the new code.
Tested on an IMC sample: 430c -> 370c.
Signed-off-by: Mans Rullgard <mans@mansr.com>
|
|
|
|
|
|
|
| |
In a 64-bit PIC build, external functions must be called
through the PLT.
Signed-off-by: Mans Rullgard <mans@mansr.com>
|
| |
|
| |
|
|
|
|
| |
Signed-off-by: Mans Rullgard <mans@mansr.com>
|
|
|
|
|
|
|
|
| |
This removes a dependency on implementation details from generic
code and allows easy addition of the equivalent optimisation for
other architectures than x86.
Signed-off-by: Mans Rullgard <mans@mansr.com>
|
| |
|
|
|
|
| |
Signed-off-by: Mans Rullgard <mans@mansr.com>
|
|
|
|
| |
Signed-off-by: Mans Rullgard <mans@mansr.com>
|
| |
|
|
|
|
| |
Signed-off-by: Mans Rullgard <mans@mansr.com>
|
|
|
|
| |
Move vector_fmul() from DSPContext to AVFloatDSPContext.
|
|
|
|
| |
Signed-off-by: Janne Grunau <janne-libav@jannau.net>
|
|
|
|
|
|
| |
This is needed for older versions of yasm/nasm that do not support AVX.
Signed-off-by: Diego Biurrun <diego@biurrun.de>
|
|
|
|
| |
Signed-off-by: Justin Ruggles <justin.ruggles@gmail.com>
|
| |
|
|
|
|
|
|
| |
Simplifies the code by using cpuflags and a new macro.
Also fixes the invalid use of the MMX2 pshufw operation in the MMX-only
function.
|
|
|
|
| |
Signed-off-by: Diego Biurrun <diego@biurrun.de>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Code mostly inspired by vp8's MC, however:
- its MMX2 horizontal filter is worse because it can't take advantage of
the coefficient redundancy
- that same coefficient redundancy allows better code for non-SSSE3 versions
Benchmark (rounded to tens of unit):
V8x8 H8x8 2D8x8 V16x16 H16x16 2D16x16
C 445 358 985 1785 1559 3280
MMX* 219 271 478 714 929 1443
SSE2 131 158 294 425 515 892
SSSE3 120 122 248 387 390 763
End result is overall around a 15% speedup for SSSE3 version (on 6 sequences);
all loop filter functions now take around 55% of decoding time, while luma MC
dsp functions are around 6%, chroma ones are 1.3% and biweight around 2.3%.
Signed-off-by: Diego Biurrun <diego@biurrun.de>
|
|
|
|
| |
Fixes a compile error with clang at -O0.
|
|
|
|
|
|
| |
Commit 356ee8d caused the initial inversion.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
|
|
|
|
|
|
| |
141 cycles down to 51.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
|
|
|
|
|
|
|
|
|
|
| |
This adds a hand-optimized assembly version for get_cabac much like the
existing one, but it works if the table offsets are RIP-relative.
Compared to the non-RIP-relative version this adds 2 lea instructions
and it needs one extra register. get_cabac() gets about 40% faster, for
an overall speedup of about 5%.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The reason is this is easier for PIC code (in particular on darwin...).
Keep the old names as pointers (static in cabac_functions.h so gcc
knows these are just immediate offsets) so the c code can nicely stay the same
(alternatively could use offsets directly in the functions needing the
tables). This should produce the same code as before with non-pic and better
code (confirmed) with pic.
The assembly uses the new table but still won't work for PIC case.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
|
|
|
|
| |
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
|
|
|
|
|
|
|
| |
This feature is complex, of questionable utility, and slows down
normal decoding.
Signed-off-by: Mans Rullgard <mans@mansr.com>
|
|
|
|
|
|
|
|
| |
This removes all references to AVCodecContext.dsp_mask and marks
it for eviction at the next version bump. It has been superseded
by av_set_cpu_flag_mask() which, unlike this field, works everywhere.
Signed-off-by: Mans Rullgard <mans@mansr.com>
|
|
|
|
| |
Fixes crashes when using biweight on win64.
|
|
|
|
|
|
|
|
| |
Recent register allocation changes (x86inc.asm update) changed the
register order and thus opcodes for the inner loops. One of them became
>128bytes, which confuses other parts of this function where it jumps
to fixed-offset positions to extend the edge by fixed amounts. A simple
register change fixes this.
|
|
|
|
|
|
| |
Fixes ac3-encode and eac3-encode FATE test failures with SSE2 disabled.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
|
|
|
|
|
| |
This should have been updated in the x86inc.asm update, but was
accidently forgotten.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add support for all x86-64 registers
Prefer caller-saved register over callee-saved on WIN64
Support up to 15 function arguments
Also (by Ronald S. Bultje)
Fix up our asm to work with new x86inc.asm.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Justin Ruggles <justin.ruggles@gmail.com>
|
|
|
|
|
|
| |
Around 10 cycles faster for luma.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
|
|
|
|
|
|
|
|
|
|
| |
Quite often, the original weights are multiple of 512. By prescaling them
by 1/512 when they are computed (once per frame), no intermediate shifting
is needed, and no prescaling on each call either.
The x86 code already used that trick.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
|
|
|
|
|
|
| |
All the more required since the users are pure SSE functions.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
|
|
|
|
| |
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
|
|
|
|
| |
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
|
|
|
|
| |
Found-by: Mateusz "j00ru" Jurczyk and Gynvael Coldwind
|
| |
|
| |
|