libav.git - [no description]

	Commit message (Collapse)	Author	Age
*	libavfilter/x86/vf_gblur: correct the order of loop step	Wu Jianhua	2021-09-18
\| \| \| \| \| \| \| \| \| \| \|	The problem was caused by if the width of the processed block minus 1 is a multiple of the aligned number the instruction jle .bscale_scalar would skip the Optimized Loop Step, which will lead to an incorrect sampling when specifying steps more than 1. Move the Optimized Loop Step after .bscale_scalar to ensure the loop step is enabled. Signed-off-by: Wu Jianhua <jianhua.wu@intel.com>
*	libavfilter/x86/vf_gblur: fixed the fate-test failed on MacOS	Wu Jianhua	2021-09-18
\| \| \| \|	Signed-off-by: Wu Jianhua <jianhua.wu@intel.com>
*	libavfilter/x86/vf_gblur: add localbuf and ff_horiz_slice_avx2/512()	Wu Jianhua	2021-08-29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We introduced a ff_horiz_slice_avx2/512() implemented on a new algorithm. In a nutshell, the new algorithm does three things, gathering data from 8/16 rows, blurring data, and scattering data back to the image buffer. Here we used a customized transpose 8x8/16x16 to avoid the huge overhead brought by gather and scatter instructions, which is dependent on the temporary buffer called localbuf added newly. Performance data: ff_horiz_slice_avx2(old): 109.89 ff_horiz_slice_avx2(new): 666.67 ff_horiz_slice_avx512: 1000 Co-authored-by: Cheng Yanfei <yanfei.cheng@intel.com> Co-authored-by: Jin Jun <jun.i.jin@intel.com> Signed-off-by: Wu Jianhua <jianhua.wu@intel.com>
*	libavfilter/x86/vf_gblur: add ff_verti_slice_avx2/512()	Wu Jianhua	2021-08-29
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The new vertical slice with AVX2/512 acceleration can significantly improve the performance of Gaussian Filter 2D. Performance data: ff_verti_slice_c: 32.57 ff_verti_slice_avx2: 476.19 ff_verti_slice_avx512: 833.33 Co-authored-by: Cheng Yanfei <yanfei.cheng@intel.com> Co-authored-by: Jin Jun <jun.i.jin@intel.com> Signed-off-by: Wu Jianhua <jianhua.wu@intel.com>
*	libavfilter/x86/vf_gblur: add ff_postscale_slice_avx512()	Wu Jianhua	2021-08-29
\| \| \| \| \| \|	Co-authored-by: Cheng Yanfei <yanfei.cheng@intel.com> Co-authored-by: Jin Jun <jun.i.jin@intel.com> Signed-off-by: Wu Jianhua <jianhua.wu@intel.com>
*	x86/vf_gblur: fix reg name in UNIX64 prologue	James Almer	2021-02-17
\| \| \| \|	Signed-off-by: James Almer <jamrial@gmail.com>
*	x86/vf_gblur: fix postscale_slice prologue	James Almer	2021-02-17
\| \| \| \| \| \| \|	x86_32 ABI does not pass float arguments directly on xmm regs, and the Win64 ABI uses only the first four regs for this purpose. Signed-off-by: James Almer <jamrial@gmail.com>
*	avfilter/x86/vf_gblur: add postscale SIMD	Paul B Mahol	2021-02-16
\|
*	avfilter/vf_gblur: fix heap-buffer overflow	Paul B Mahol	2019-10-16
\| \| \| \|	Fixes #8282
*	avfilter/vf_gblur: add x86 SIMD optimizations	Ruiling Song	2019-06-12
	The horizontal pass get ~2x performance with the patch under single thread. Tested overall performance using the command(avx2 enabled): ./ffmpeg -i 1080p.mp4 -vf gblur -f null /dev/null ./ffmpeg -i 1080p.mp4 -vf gblur=threads=1 -f null /dev/null For single thread, the fps improves from 43 to 60, about 40%. For multi-thread, the fps improves from 110 to 130, about 20%. Signed-off-by: Ruiling Song <ruiling.song@intel.com>