libav.git - [no description]

	Commit message (Collapse)	Author	Age
*	lavc/audiodsp: fix RISC-V V scalar product (again)	Rémi Denis-Courmont	2022-10-17
\| \| \| \| \|	The loop uses a 32-bit accumulator. The current code would only zero the lower 16 bits thereof.
*	riscv: fix scalar product initialisation	Rémi Denis-Courmont	2022-10-13
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	VSETVLI xd, x0, ...' has rather nonobvious semantics: - If xd is x0, then it preserves the current vector length. - If xd is not x0, it sets the vector length to the supported maximum. Also somewhat confusingly, while VMV.X.S always does its thing regardless of the selected vector length, VMV.S.X does _nothing_ if the selected vector length is zero. So the current code breaks fails to initialise the accumulator if we are unlucky to have a selected vector length of zero on entry. Fix it by forcing the vector length to one.
*	lavc/aacpsdsp: fix clobber on RISC-V LP64D/ILP32D	Rémi Denis-Courmont	2022-10-10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Although the DSP function only uses single precision from RISC-V F, the caller may leave double precision values in the spilled registers if the calling convention supports double precision hardware floats. Then, we need to save and restore FS registers as double precision. Conversely, we do not need to save anything at all if an integer calling convention is in use. However we can assume that single precision floats are supported, since the Zve32f extension implies the F extension. So for the sake of simplicity, we always save at least single precision values. In theory, we should even save quadruple precision values if the LP64Q ABI is in use. I have yet to see a compiler that supports it though.
*	lavc/opusdsp: RISC-V V (512-bit) postfilter	Rémi Denis-Courmont	2022-10-10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This adds a variant of the postfilter for use with 512-bit vectors. Half a vector is enough to perform the scalar product. Normally a whole vector would be used anyhow. Indeed fractional multiplers are no faster than the unit multipler. But in this particular function, a full vector makes up 16 samples, which would be loaded at each iteration of the outer loop. The minimum guaranteed CELT postfilter period is only 15. Accounting for the edges, we can only safely preload up to 13 samples. The fractional multipler is thus used to cap the selected vector length to a safe value of 8 elements or 256 bits. Likewise, we have the 1024-bit variant with the quarter multipler. In theory, a 2048-bit one would be possible with the eigth multipler, but that length is not even defined in the specifications as of yet, nor is it supported by any emulator - forget actual hardware.
*	lavc/opusdsp: RISC-V V (256-bit) postfilter	Rémi Denis-Courmont	2022-10-10
\| \| \| \| \| \| \| \| \| \|	This adds a variant of the postfilter for use with 256-bit vectors. As a single vector is then large enough to perform the scalar product, the group multipler is reduced to just one at run-time. The different vector type is passed via register. Unfortunately, there is no VSETIVL instruction, so the constant vector size (5) also needs to be passed via a register.
*	lavc/opusdsp: RISC-V V (128-bit) postfilter	Rémi Denis-Courmont	2022-10-10
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is implemented for a vector size of 128-bit. Since the scalar product in the inner loop covers 5 samples or 160 bits, we need a group multipler of 2. To avoid reconfiguring the vector type, the outer loop, which loads multiple input samples sticks to the same multipler. Consequently, the outer loop loads 8 samples per iteration. This is safe since the minimum period of the CELT codec is 15 samples. The same code would also work, albeit needlessly inefficiently with a vector length of 256 bits. A proper implementation will follow instead.
*	lavc/bswapdsp: RISC-V V bswap16_buf	Rémi Denis-Courmont	2022-10-05
\|
*	lavc/bswapdsp: RISC-V V bswap_buf	Rémi Denis-Courmont	2022-10-05
\|
*	lavc/bswapdsp: RISC-V B bswap_buf	Rémi Denis-Courmont	2022-10-05
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Simply taking the Zbb REV8 instruction into use in a simple loop gives some significant savings: bswap_buf_c: 1081.0 bswap_buf_rvb_b: 771.0 But we can also use the 64-bit REV8 as a pseudo-SIMD instruction with just one additional shift, and one fewer load, effectively doubling the bandwidth. Consequently, this patch is useful even if the compile-time target has Zbb enabled for C code: bswap_buf_c: 1081.0 bswap_buf_rvb_b: 341.0 (this patch) On the other hand, this approach fails miserably for bswap16_buf as the ratio of shifts and stores becomes unfavorable compared to naïve C: bswap16_buf_c: 1542.0 bswap16_buf_rvb_b: 1803.7 Unrolling to process 128 bits (4 samples) at a time actually worsens performance ever so slightly: bswap_buf_c: 1081.0 bswap_buf_rvb_b: 408.5
*	riscv/alacdsp: drop config.h include	Lynne	2022-10-05
\|
*	riscv: remove unnecessary #include's	Rémi Denis-Courmont	2022-10-05
\| \| \| \|	Pointed out by Andreas Rheinhardt.
*	lavc/alacdsp: RISC-V V append_extra_bits[1]	Rémi Denis-Courmont	2022-10-05
\|
*	lavc/alacdsp: RISC-V V append_extra_bits[0]	Rémi Denis-Courmont	2022-10-05
\|
*	lavc/alacdsp: RISC-V V decorrelate_stereo	Rémi Denis-Courmont	2022-10-05
\| \| \| \| \| \| \| \| \| \| \|	To avoid data dependencies, this does the following unroll, which requires one extra but probably free addition: coeff = (b * left_weight) >> decorr_shift; b += a; a -= coeff; b -= coeff; swap(a, b);
*	riscv: Fix linking without RVV; change #ifdef into #if	Martin Storsjö	2022-09-29
\| \| \| \|	Signed-off-by: Martin Storsjö <martin@martin.st>
*	lavc/pixblockdsp: RISC-V diff_pixels & diff_pixels_unaligned	Rémi Denis-Courmont	2022-09-28
\|
*	lavc/pixblockdsp: RISC-V V 16-bit get_pixels & get_pixels_unaligned	Rémi Denis-Courmont	2022-09-28
\|
*	lavc/pixblockdsp: RISC-V V 8-bit get_pixels & get_pixels_unaligned	Rémi Denis-Courmont	2022-09-28
\|
*	lavc/idctdsp: RISC-V V put_signed_pixels_clamped function	Rémi Denis-Courmont	2022-09-28
\|
*	lavc/idctdsp: RISC-V V add_pixels_clamped function	Rémi Denis-Courmont	2022-09-28
\|
*	lavc/idctdsp: RISC-V V put_pixels_clamped function	Rémi Denis-Courmont	2022-09-28
\|
*	riscv: Use the correct path for including asm.S	Martin Storsjö	2022-09-28
\| \| \| \|	Signed-off-by: Martin Storsjö <martin@martin.st>
*	lavc/aacpsdsp: RISC-V V stereo_interpolate[0]	Rémi Denis-Courmont	2022-09-27
\|
*	lavc/aacpsdsp: RISC-V V hybrid_synthesis_deint	Rémi Denis-Courmont	2022-09-27
\|
*	lavc/aacpsdsp: RISC-V V hybrid_analysis_ileave	Rémi Denis-Courmont	2022-09-27
\|
*	lavc/aacpsdsp: RISC-V V hybrid_analysis	Rémi Denis-Courmont	2022-09-27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This starts with one-time initialisation of the 26 constant factors like 08edacc248bce3f8946d75e97188d189c74a6de6. That is done with the scalar instruction set. While the formula can readily be vectored, the gains would (probably) be more than lost in transfering the results back to FP registers (or suitably reshuffling them into vector registers). Note that the main loop could likely be scheduled sligthly better by expanding the filter macro and interleaving loads with arithmetic. It is not clear yet if that would be relevant for vector processing (as opposed to traditional SIMD). We could also use fewer vectors, but there is not much point in sparing them (they are all callee-clobbered).
*	lavc/aacpsdsp: RISC-V V mul_pair_single	Rémi Denis-Courmont	2022-09-27
\|
*	lavc/aacpsdsp: RISC-V V add_squares	Rémi Denis-Courmont	2022-09-27
\|
*	lavc/vorbisdsp: RISC-V V inverse_coupling	Rémi Denis-Courmont	2022-09-27
\| \| \| \| \| \| \| \| \|	This uses the following vectorisation: for (i = 0; i < blocksize; i++) { ang[i] = mag[i] - copysignf(fmaxf(ang[i], 0.f), mag[i]); mag[i] = mag[i] - copysignf(fminf(ang[i], 0.f), mag[i]); }
*	lavc/fmtconvert: RISC-V V int32_to_float_fmul_array8	Rémi Denis-Courmont	2022-09-27
\|
*	lavc/fmtconvert: RISC-V V int32_to_float_fmul_scalar	Rémi Denis-Courmont	2022-09-27
\|
*	lavc/audiodsp: RISC-V V scalarproduct_int16	Rémi Denis-Courmont	2022-09-27
\|
*	lavc/audiodsp: RISC-V V vector_clipf	Rémi Denis-Courmont	2022-09-27
\|
*	lavc/audiodsp: RISC-V V vector_clip_int32	Rémi Denis-Courmont	2022-09-27
\|
*	lavc/pixblockdsp: RISC-V I get_pixels	Rémi Denis-Courmont	2022-09-27
\| \| \| \| \| \|	Benchmarks on SiFive U74-MC (courtesy of Shanghai StarFive Tech): get_pixels_c: 180.0 get_pixels_rvi: 136.7
*	lavc/audiodsp: RISC-V F vector_clipf	Rémi Denis-Courmont	2022-09-27
	RV64G supports MIN & MAX instructions natively only on floating point registers, not general purpose ones. The later would require the Zbb extension. Due to that, it is actually faster to perform the clipping "properly" in FPU. Benchmarks on SiFive U74-MC (courtesy of Shanghai StarFive Tech): audiodsp.vector_clipf_c: 29551.5 audiodsp.vector_clipf_rvf: 17871.0 Also tried unrolling with 2 or 8 elements but it gets worse either way.