summaryrefslogtreecommitdiff
path: root/libswscale/aarch64
Commit message (Collapse)AuthorAge
* sws: rename SwsContext.swscale to convert_unscaledAnton Khirnov2021-07-03
| | | | That function pointer is now used only for unscaled conversion.
* Include attributes.h directlyAndreas Rheinhardt2021-04-19
| | | | | | | | Some files currently rely on libavutil/cpu.h to include it for them; yet said file won't use include it any more after the currently deprecated functions are removed, so include attributes.h directly. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
* aarch64/yuv2rgb_neon: fix return valueLynne2020-07-09
| | | | | | We return 0 for this particular architecture but should instead be returning the number of lines. Fixes users who check the return value matches what they expect.
* swscale: aarch64: Add a NEON implementation of interleaveBytesMartin Storsjö2020-05-15
| | | | | | | | | | | | This allows speeding up format conversions from yuv420 to nv12. Cortex A53 A72 A73 interleave_bytes_c: 86077.5 51433.0 66972.0 interleave_bytes_neon: 19701.7 23019.2 15859.2 interleave_bytes_aligned_c: 86603.0 52017.2 67484.2 interleave_bytes_aligned_neon: 9061.0 7623.0 6309.0 Signed-off-by: Martin Storsjö <martin@martin.st>
* swscale: fix NEON hscale initJosh de Kock2020-05-15
| | | | | | | | | | | | | The NEON hscale function only supports X8 filter sizes and should only be selected when these are being used. At the moment filterAlign is set to 8 but in the future when extra NEON assembly for specific sizes is added they will need to have checks here too. The immediate usecase for this change is making the hscale checkasm test easier and without NEON specific edge-cases (x86 already has these guards). Signed-off-by: Josh de Kock <josh@itanimul.li>
* swscale: aarch64: Don't clobber callee-saved registers v8-v15Martin Storsjö2020-04-21
| | | | Signed-off-by: Martin Storsjö <martin@martin.st>
* swscale: aarch64: Avoid using the x18 registerMartin Storsjö2020-04-20
| | | | | | | | | | The x18 is a reserved platform register on Darwin and Windows. x8/w8 seems to be unused in this function though (and same about x10 and x14), so there's really no reason to use x18 here - just change the uses of x18/w18 into x8/w8 instead without any further rewrites. Signed-off-by: Martin Storsjö <martin@martin.st>
* swscale/aarch64: use multiply accumulate and shift-right narrowSebastian Pop2020-01-04
| | | | | | | | | | | | | | | | | | | This patch rewrites the innermost loop of ff_yuv2planeX_8_neon to avoid zips and horizontal adds by using fused multiply adds. The patch also uses ld1r to load one element and replicate it across all lanes of the vector. The patch also improves the clipping code by removing the shift right instructions and performing the shift with the shift-right narrow instructions. I see 8% difference on an m6g instance with neoverse-n1 CPUs: $ ffmpeg -nostats -f lavfi -i testsrc2=4k:d=2 -vf bench=start,scale=1024x1024,bench=stop -f null - before: t:0.014015 avg:0.014096 max:0.015018 min:0.013971 after: t:0.012985 avg:0.013013 max:0.013996 min:0.012818 Tested with `make check` on aarch64-linux. Signed-off-by: Sebastian Pop <spop@amazon.com> Reviewed-by: Clément Bœsch <u@pkh.me> Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
* swscale/aarch64: use multiply accumulate and increase vector factor to 4Sebastian Pop2019-12-17
| | | | | | | | | | | | | | | | | | | | | This patch implements ff_hscale_8_to_15_neon with NEON fused multiply accumulate and bumps the vectorization factor from 2 to 4. The speedup is of 25% on Graviton1 A1 instances based on A-72 cpus: $ ffmpeg -nostats -f lavfi -i testsrc2=4k:d=2 -vf bench=start,scale=1024x1024,bench=stop -f null - before: t:0.040303 avg:0.040287 max:0.040371 min:0.039214 after: t:0.032168 avg:0.032215 max:0.033081 min:0.032146 The speedup is of 39% on Graviton2 m6g instances based on Neoverse-N1 cpus: $ ffmpeg -nostats -f lavfi -i testsrc2=4k:d=2 -vf bench=start,scale=1024x1024,bench=stop -f null - before: t:0.019446 avg:0.019423 max:0.019493 min:0.019181 after: t:0.014015 avg:0.014096 max:0.015018 min:0.013971 Tested with `make check` on aarch64-linux. Signed-off-by: Sebastian Pop <spop@amazon.com> Reviewed-by: Jean-Baptiste Kempf <jb@videolan.org> Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
* sws/aarch64: add ff_yuv2planeX_8_neonClément Bœsch2016-04-11
|
* sws/aarch64/yuv2rgb: honor iOS calling conventionClément Bœsch2016-04-08
| | | | | | | | | | | | | | | | | y_offset and y_coeff being successive 32-bit integers, they are packed into 8 bytes instead of 2x8 bytes. See https://developer.apple.com/library/ios/documentation/Xcode/Conceptual/iPhoneOSABIReference/Articles/ARM64FunctionCallingConventions.html > iOS diverges from Procedure Call Standard for the ARM 64-bit > Architecture in several ways [...] > In the generic procedure call standard, all function arguments passed > on the stack consume slots in multiples of 8 bytes. In iOS, this > requirement is dropped, and values consume only the space required. [...] > Padding is still inserted on the stack to satisfy arguments’ alignment > requirements.
* sws/aarch64: restore ff_hscale_8_to_15_neon()Clément Bœsch2016-04-05
| | | | Fix final scaling and required filter alignment. Pass FATE.
* sws/aarch64: disable ff_hscale_8_to_15_neon temporarlyClément Bœsch2016-04-01
| | | | Looks broken.
* sws/aarch64: add ff_hscale_8_to_15_neonClément Bœsch2016-03-31
| | | | | | | ./ffmpeg -nostats -f lavfi -i testsrc2=4k:d=2 -vf bench=start,scale=1024x1024,bench=stop -f null - before: t:0.489726 avg:0.489883 max:0.491852 min:0.489482 after: t:0.256515 avg:0.256458 max:0.256999 min:0.253755
* sws/aarch64/yuv2rgb: save a few mul and addClément Bœsch2016-03-25
| | | | 27ms to 26ms with UHD 2160 input.
* sws/aarch64: add {nv12,nv21,yuv420p,yuv422p}_to_{argb,rgba,abgr,rgba}_neonClément Bœsch2016-03-01