summaryrefslogtreecommitdiff
path: root/libavcodec/riscv
Commit message (Collapse)AuthorAge
* lavc/audiodsp: fix RISC-V V scalar product (again)Rémi Denis-Courmont2022-10-17
| | | | | The loop uses a 32-bit accumulator. The current code would only zero the lower 16 bits thereof.
* riscv: fix scalar product initialisationRémi Denis-Courmont2022-10-13
| | | | | | | | | | | | | | VSETVLI xd, x0, ...' has rather nonobvious semantics: - If xd is x0, then it preserves the current vector length. - If xd is not x0, it sets the vector length to the supported maximum. Also somewhat confusingly, while VMV.X.S always does its thing regardless of the selected vector length, VMV.S.X does _nothing_ if the selected vector length is zero. So the current code breaks fails to initialise the accumulator if we are unlucky to have a selected vector length of zero on entry. Fix it by forcing the vector length to one.
* lavc/aacpsdsp: fix clobber on RISC-V LP64D/ILP32DRémi Denis-Courmont2022-10-10
| | | | | | | | | | | | | | | | Although the DSP function only uses single precision from RISC-V F, the caller may leave double precision values in the spilled registers if the calling convention supports double precision hardware floats. Then, we need to save and restore FS registers as double precision. Conversely, we do not need to save anything at all if an integer calling convention is in use. However we can assume that single precision floats are supported, since the Zve32f extension implies the F extension. So for the sake of simplicity, we always save at least single precision values. In theory, we should even save quadruple precision values if the LP64Q ABI is in use. I have yet to see a compiler that supports it though.
* lavc/opusdsp: RISC-V V (512-bit) postfilterRémi Denis-Courmont2022-10-10
| | | | | | | | | | | | | | | | | | | | This adds a variant of the postfilter for use with 512-bit vectors. Half a vector is enough to perform the scalar product. Normally a whole vector would be used anyhow. Indeed fractional multiplers are no faster than the unit multipler. But in this particular function, a full vector makes up 16 samples, which would be loaded at each iteration of the outer loop. The minimum guaranteed CELT postfilter period is only 15. Accounting for the edges, we can only safely preload up to 13 samples. The fractional multipler is thus used to cap the selected vector length to a safe value of 8 elements or 256 bits. Likewise, we have the 1024-bit variant with the quarter multipler. In theory, a 2048-bit one would be possible with the eigth multipler, but that length is not even defined in the specifications as of yet, nor is it supported by any emulator - forget actual hardware.
* lavc/opusdsp: RISC-V V (256-bit) postfilterRémi Denis-Courmont2022-10-10
| | | | | | | | | | This adds a variant of the postfilter for use with 256-bit vectors. As a single vector is then large enough to perform the scalar product, the group multipler is reduced to just one at run-time. The different vector type is passed via register. Unfortunately, there is no VSETIVL instruction, so the constant vector size (5) also needs to be passed via a register.
* lavc/opusdsp: RISC-V V (128-bit) postfilterRémi Denis-Courmont2022-10-10
| | | | | | | | | | | | | | This is implemented for a vector size of 128-bit. Since the scalar product in the inner loop covers 5 samples or 160 bits, we need a group multipler of 2. To avoid reconfiguring the vector type, the outer loop, which loads multiple input samples sticks to the same multipler. Consequently, the outer loop loads 8 samples per iteration. This is safe since the minimum period of the CELT codec is 15 samples. The same code would also work, albeit needlessly inefficiently with a vector length of 256 bits. A proper implementation will follow instead.
* lavc/bswapdsp: RISC-V V bswap16_bufRémi Denis-Courmont2022-10-05
|
* lavc/bswapdsp: RISC-V V bswap_bufRémi Denis-Courmont2022-10-05
|
* lavc/bswapdsp: RISC-V B bswap_bufRémi Denis-Courmont2022-10-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Simply taking the Zbb REV8 instruction into use in a simple loop gives some significant savings: bswap_buf_c: 1081.0 bswap_buf_rvb_b: 771.0 But we can also use the 64-bit REV8 as a pseudo-SIMD instruction with just one additional shift, and one fewer load, effectively doubling the bandwidth. Consequently, this patch is useful even if the compile-time target has Zbb enabled for C code: bswap_buf_c: 1081.0 bswap_buf_rvb_b: 341.0 (this patch) On the other hand, this approach fails miserably for bswap16_buf as the ratio of shifts and stores becomes unfavorable compared to naïve C: bswap16_buf_c: 1542.0 bswap16_buf_rvb_b: 1803.7 Unrolling to process 128 bits (4 samples) at a time actually worsens performance ever so slightly: bswap_buf_c: 1081.0 bswap_buf_rvb_b: 408.5
* riscv/alacdsp: drop config.h includeLynne2022-10-05
|
* riscv: remove unnecessary #include'sRémi Denis-Courmont2022-10-05
| | | | Pointed out by Andreas Rheinhardt.
* lavc/alacdsp: RISC-V V append_extra_bits[1]Rémi Denis-Courmont2022-10-05
|
* lavc/alacdsp: RISC-V V append_extra_bits[0]Rémi Denis-Courmont2022-10-05
|
* lavc/alacdsp: RISC-V V decorrelate_stereoRémi Denis-Courmont2022-10-05
| | | | | | | | | | | To avoid data dependencies, this does the following unroll, which requires one extra but probably free addition: coeff = (b * left_weight) >> decorr_shift; b += a; a -= coeff; b -= coeff; swap(a, b);
* riscv: Fix linking without RVV; change #ifdef into #ifMartin Storsjö2022-09-29
| | | | Signed-off-by: Martin Storsjö <martin@martin.st>
* lavc/pixblockdsp: RISC-V diff_pixels & diff_pixels_unalignedRémi Denis-Courmont2022-09-28
|
* lavc/pixblockdsp: RISC-V V 16-bit get_pixels & get_pixels_unalignedRémi Denis-Courmont2022-09-28
|
* lavc/pixblockdsp: RISC-V V 8-bit get_pixels & get_pixels_unalignedRémi Denis-Courmont2022-09-28
|
* lavc/idctdsp: RISC-V V put_signed_pixels_clamped functionRémi Denis-Courmont2022-09-28
|
* lavc/idctdsp: RISC-V V add_pixels_clamped functionRémi Denis-Courmont2022-09-28
|
* lavc/idctdsp: RISC-V V put_pixels_clamped functionRémi Denis-Courmont2022-09-28
|
* riscv: Use the correct path for including asm.SMartin Storsjö2022-09-28
| | | | Signed-off-by: Martin Storsjö <martin@martin.st>
* lavc/aacpsdsp: RISC-V V stereo_interpolate[0]Rémi Denis-Courmont2022-09-27
|
* lavc/aacpsdsp: RISC-V V hybrid_synthesis_deintRémi Denis-Courmont2022-09-27
|
* lavc/aacpsdsp: RISC-V V hybrid_analysis_ileaveRémi Denis-Courmont2022-09-27
|
* lavc/aacpsdsp: RISC-V V hybrid_analysisRémi Denis-Courmont2022-09-27
| | | | | | | | | | | | | | | | | This starts with one-time initialisation of the 26 constant factors like 08edacc248bce3f8946d75e97188d189c74a6de6. That is done with the scalar instruction set. While the formula can readily be vectored, the gains would (probably) be more than lost in transfering the results back to FP registers (or suitably reshuffling them into vector registers). Note that the main loop could likely be scheduled sligthly better by expanding the filter macro and interleaving loads with arithmetic. It is not clear yet if that would be relevant for vector processing (as opposed to traditional SIMD). We could also use fewer vectors, but there is not much point in sparing them (they are *all* callee-clobbered).
* lavc/aacpsdsp: RISC-V V mul_pair_singleRémi Denis-Courmont2022-09-27
|
* lavc/aacpsdsp: RISC-V V add_squaresRémi Denis-Courmont2022-09-27
|
* lavc/vorbisdsp: RISC-V V inverse_couplingRémi Denis-Courmont2022-09-27
| | | | | | | | | This uses the following vectorisation: for (i = 0; i < blocksize; i++) { ang[i] = mag[i] - copysignf(fmaxf(ang[i], 0.f), mag[i]); mag[i] = mag[i] - copysignf(fminf(ang[i], 0.f), mag[i]); }
* lavc/fmtconvert: RISC-V V int32_to_float_fmul_array8Rémi Denis-Courmont2022-09-27
|
* lavc/fmtconvert: RISC-V V int32_to_float_fmul_scalarRémi Denis-Courmont2022-09-27
|
* lavc/audiodsp: RISC-V V scalarproduct_int16Rémi Denis-Courmont2022-09-27
|
* lavc/audiodsp: RISC-V V vector_clipfRémi Denis-Courmont2022-09-27
|
* lavc/audiodsp: RISC-V V vector_clip_int32Rémi Denis-Courmont2022-09-27
|
* lavc/pixblockdsp: RISC-V I get_pixelsRémi Denis-Courmont2022-09-27
| | | | | | Benchmarks on SiFive U74-MC (courtesy of Shanghai StarFive Tech): get_pixels_c: 180.0 get_pixels_rvi: 136.7
* lavc/audiodsp: RISC-V F vector_clipfRémi Denis-Courmont2022-09-27
RV64G supports MIN & MAX instructions natively only on floating point registers, not general purpose ones. The later would require the Zbb extension. Due to that, it is actually faster to perform the clipping "properly" in FPU. Benchmarks on SiFive U74-MC (courtesy of Shanghai StarFive Tech): audiodsp.vector_clipf_c: 29551.5 audiodsp.vector_clipf_rvf: 17871.0 Also tried unrolling with 2 or 8 elements but it gets worse either way.