| Commit message (Collapse) | Author | Age |
| |
|
|
|
|
|
|
| |
Fixes compilation with old yasm.
Signed-off-by: James Almer <jamrial@gmail.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I spotted an interesting pattern that I didn't see before that leads to the implementation being faster.
The bit shifting table I was using before is no longer needed, and was able to remove quite a few lines.
I also add use of FMA on the AVX2 version.
f32 1920x1080 1 thread with prelut
c impl
1434012700 UNITS in lut3d->interp, 1 runs, 0 skips
1434035335 UNITS in lut3d->interp, 2 runs, 0 skips
1423615347 UNITS in lut3d->interp, 4 runs, 0 skips
1426268863 UNITS in lut3d->interp, 8 runs, 0 skips
sse2
905484420 UNITS in lut3d->interp, 1 runs, 0 skips
905659010 UNITS in lut3d->interp, 2 runs, 0 skips
915167140 UNITS in lut3d->interp, 4 runs, 0 skips
915834222 UNITS in lut3d->interp, 8 runs, 0 skips
avx
574794860 UNITS in lut3d->interp, 1 runs, 0 skips
581035090 UNITS in lut3d->interp, 2 runs, 0 skips
584116720 UNITS in lut3d->interp, 4 runs, 0 skips
581460290 UNITS in lut3d->interp, 8 runs, 0 skips
avx2
301698880 UNITS in lut3d->interp, 1 runs, 0 skips
301982880 UNITS in lut3d->interp, 2 runs, 0 skips
306962430 UNITS in lut3d->interp, 4 runs, 0 skips
305472025 UNITS in lut3d->interp, 8 runs, 0 skips
gbrap16 1920x1080 1 thread with prelut
c impl
1480894840 UNITS in lut3d->interp, 1 runs, 0 skips
1502922990 UNITS in lut3d->interp, 2 runs, 0 skips
1496114307 UNITS in lut3d->interp, 4 runs, 0 skips
1492554551 UNITS in lut3d->interp, 8 runs, 0 skips
sse2
980777180 UNITS in lut3d->interp, 1 runs, 0 skips
986121520 UNITS in lut3d->interp, 2 runs, 0 skips
986489840 UNITS in lut3d->interp, 4 runs, 0 skips
998832248 UNITS in lut3d->interp, 8 runs, 0 skips
avx
622212360 UNITS in lut3d->interp, 1 runs, 0 skips
622981160 UNITS in lut3d->interp, 2 runs, 0 skips
645396315 UNITS in lut3d->interp, 4 runs, 0 skips
641057075 UNITS in lut3d->interp, 8 runs, 0 skips
avx2
321336400 UNITS in lut3d->interp, 1 runs, 0 skips
321268920 UNITS in lut3d->interp, 2 runs, 0 skips
323459895 UNITS in lut3d->interp, 4 runs, 0 skips
324949967 UNITS in lut3d->interp, 8 runs, 0 skips
|
|
|
|
| |
Signed-off-by: Wu Jianhua <jianhua.wu@intel.com>
|
|
|
|
|
|
|
|
|
|
|
| |
The problem was caused by if the width of the processed block
minus 1 is a multiple of the aligned number the instruction
jle .bscale_scalar would skip the Optimized Loop Step, which
will lead to an incorrect sampling when specifying steps more
than 1. Move the Optimized Loop Step after .bscale_scalar to
ensure the loop step is enabled.
Signed-off-by: Wu Jianhua <jianhua.wu@intel.com>
|
|
|
|
| |
Signed-off-by: Wu Jianhua <jianhua.wu@intel.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We introduced a ff_horiz_slice_avx2/512() implemented on a new algorithm.
In a nutshell, the new algorithm does three things, gathering data from
8/16 rows, blurring data, and scattering data back to the image buffer.
Here we used a customized transpose 8x8/16x16 to avoid the huge overhead
brought by gather and scatter instructions, which is dependent on the
temporary buffer called localbuf added newly.
Performance data:
ff_horiz_slice_avx2(old): 109.89
ff_horiz_slice_avx2(new): 666.67
ff_horiz_slice_avx512: 1000
Co-authored-by: Cheng Yanfei <yanfei.cheng@intel.com>
Co-authored-by: Jin Jun <jun.i.jin@intel.com>
Signed-off-by: Wu Jianhua <jianhua.wu@intel.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The new vertical slice with AVX2/512 acceleration can significantly
improve the performance of Gaussian Filter 2D.
Performance data:
ff_verti_slice_c: 32.57
ff_verti_slice_avx2: 476.19
ff_verti_slice_avx512: 833.33
Co-authored-by: Cheng Yanfei <yanfei.cheng@intel.com>
Co-authored-by: Jin Jun <jun.i.jin@intel.com>
Signed-off-by: Wu Jianhua <jianhua.wu@intel.com>
|
|
|
|
|
|
| |
Co-authored-by: Cheng Yanfei <yanfei.cheng@intel.com>
Co-authored-by: Jin Jun <jun.i.jin@intel.com>
Signed-off-by: Wu Jianhua <jianhua.wu@intel.com>
|
| |
|
|
|
|
| |
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
|
|
|
|
| |
Signed-off-by: James Almer <jamrial@gmail.com>
|
|
|
|
|
|
|
| |
x86_32 ABI does not pass float arguments directly on xmm regs, and the Win64
ABI uses only the first four regs for this purpose.
Signed-off-by: James Almer <jamrial@gmail.com>
|
| |
|
|
|
|
| |
Based on patch by Xu Jun <xujunzz@sjtu.edu.cn>
|
| |
|
| |
|
| |
|
|
|
|
| |
This reverts commit 2a9b934675b9e2d3850b46f8a618c19b03f02551.
|
|
|
|
| |
Signed-off-by: Limin Wang <lance.lmwang@gmail.com>
|
|
|
|
|
|
| |
Finishes fixing ticket #8771
Signed-off-by: James Almer <jamrial@gmail.com>
|
| |
|
| |
|
|
|
|
|
|
|
| |
This fixes the tests filter-refcmp-ssim-yuv and filter-refcmp-ssim-rgb
on i386 after breaking in fcc0424c933742c8fc852371e985d16b6eb4bfe9.
Signed-off-by: Martin Storsjö <martin@martin.st>
|
|
|
|
| |
Use doubles for accumulating floats.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Fixes crashes in command lines such as:
ffmpeg -f lavfi -i testsrc2=704x576:r=50,interlace,pad=720:576:8 -f null none
Related to ticket #6491.
Signed-off-by: Marton Balint <cus@passwd.hu>
|
| |
|
|
|
|
|
| |
Reviewed-by: Paul B Mahol <onemda@gmail.com>
Signed-off-by: James Almer <jamrial@gmail.com>
|
|
|
|
| |
Signed-off-by: James Almer <jamrial@gmail.com>
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
| |
Fixes #8282
|
|
|
|
|
|
|
|
| |
They are not allowed outside of functions. Fixes the warning
"ISO C does not allow extra ‘;’ outside of a function [-Wpedantic]"
when compiling with GCC and -pedantic.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@gmail.com>
|
|
|
|
| |
Signed-off-by: James Almer <jamrial@gmail.com>
|
|
|
|
| |
Signed-off-by: Ting Fu <ting.fu@intel.com>
|
|
|
|
| |
Signed-off-by: Ting Fu <ting.fu@intel.com>
|
| |
|
|
|
|
| |
Signed-off-by: James Almer <jamrial@gmail.com>
|
|
|
|
| |
Signed-off-by: James Almer <jamrial@gmail.com>
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
Tested using a simple command (apply edge enhance):
./ffmpeg_g -i ~/Downloads/bbb_sunflower_1080p_30fps_normal.mp4 \
-vf convolution="0 0 0 -1 1 0 0 0 0:0 0 0 -1 1 0 0 0 0:0 0 0 -1 1 0 0 0 0:0 0 0 -1 1 0 0 0 0:5:1:1:1:0:128:128:128" \
-an -vframes 1000 -f null /dev/null
The fps increase from 151 to 270 on my local machine.
Signed-off-by: Ruiling Song <ruiling.song@intel.com>
|
|
|
|
|
|
| |
Fixes compilation on x86_32
Signed-off-by: James Almer <jamrial@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The horizontal pass get ~2x performance with the patch
under single thread.
Tested overall performance using the command(avx2 enabled):
./ffmpeg -i 1080p.mp4 -vf gblur -f null /dev/null
./ffmpeg -i 1080p.mp4 -vf gblur=threads=1 -f null /dev/null
For single thread, the fps improves from 43 to 60, about 40%.
For multi-thread, the fps improves from 110 to 130, about 20%.
Signed-off-by: Ruiling Song <ruiling.song@intel.com>
|
| |
|
|
|
|
|
|
| |
Fixes compilation with old yasm versions.
Signed-off-by: James Almer <jamrial@gmail.com>
|