| Commit message (Collapse) | Author | Age |
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
| |
Fixes "redundant redeclaration" warnings.
Signed-off-by: James Almer <jamrial@gmail.com>
|
|
|
|
| |
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
|
|
|
|
|
|
|
| |
Also a slight change to the ssse3 code, which prevents a theoretical
overflow in the sharp filter.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
|
|
|
|
| |
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
|
|
|
|
|
|
|
| |
Roughly 25% faster MC than ssse3 for blocksizes 32 and 64.
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: James Almer <jamrial@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
tos3k-vp9-b10000.webm on a Core i5-4200U @1.6GHz
1219 decicycles in ff_vp9_ipred_dc_32x32_ssse3, 131070 runs, 2 skips
439 decicycles in ff_vp9_ipred_dc_32x32_avx2, 131070 runs, 2 skips
3570 decicycles in ff_vp9_ipred_dc_top_32x32_ssse3, 4096 runs, 0 skips
2494 decicycles in ff_vp9_ipred_dc_top_32x32_avx2, 4096 runs, 0 skips
1419 decicycles in ff_vp9_ipred_dc_left_32x32_ssse3, 16384 runs, 0 skips
717 decicycles in ff_vp9_ipred_dc_left_32x32_avx2, 16384 runs, 0 skips
2737 decicycles in ff_vp9_ipred_tm_32x32_avx, 1024 runs, 0 skips
2088 decicycles in ff_vp9_ipred_tm_32x32_avx2, 1024 runs, 0 skips
3090 decicycles in ff_vp9_ipred_v_32x32_avx, 512 runs, 0 skips
2226 decicycles in ff_vp9_ipred_v_32x32_avx2, 512 runs, 0 skips
1565 decicycles in ff_vp9_ipred_h_32x32_avx, 1024 runs, 0 skips
922 decicycles in ff_vp9_ipred_h_32x32_avx2, 1024 runs, 0 skips
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
|
| |
|
|
|
|
|
| |
Partially based on h264_intrapred. (I hope to eventually merge these
two intrapred implementations back together.)
|
| |
|
|
|
|
|
| |
5.40s → 5.30s overall decode time with -threads 1 on ped1080p.webm
(i7 920, ssse3)
|
|
|
|
|
|
| |
Similar gains as the ssse3 version once again
Signed-off-by: James Almer <jamrial@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
9680 decicycles in loop_filter_v_88_16_c, 4193765 runs, 539 skips
9233 decicycles in loop_filter_h_88_16_c, 4193751 runs, 553 skips
1929 decicycles in ff_vp9_loop_filter_v_88_16_ssse3, 4194118 runs, 186 skips
2738 decicycles in ff_vp9_loop_filter_h_88_16_ssse3, 4193861 runs, 443 skips
5.978 → 5.417 overall decode time on ped1080p.webm (-threads 1)
Adding SSE2 support should be relatively trivial (just a matter of
changing the pshufb [mask_mix] with something else), patch welcome.
|
| |
|
|
|
|
|
|
|
|
| |
Cycle measurements for intra itxfm_4x4_add on ped1080p.webm:
idct_idct: 66 -> 67 cycles (noise measurement)
idct_iadst: 199 -> 79 cycles
iadst_idct: 165 -> 70 cycles
iadst_iadst: 183 -> 82 cycles
|
|
|
|
|
|
|
|
| |
Cycle measurements for intra itxfm_8x8_add on ped1080p.webm:
idct_idct: 133 -> 135 cycles (noise measurement)
idct_iadst: 900 -> 241 cycles
iadst_idct: 864 -> 215 cycles
iadst_iadst: 973 -> 310 cycles
|
|
|
|
|
|
|
|
| |
pavgb is an sse integer instruction, so the mmxext flag is enough
Signed-off-by: James Almer <jamrial@gmail.com>
Reviewed-by: "Ronald S. Bultje" <rsbultje@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
|
|
|
|
|
|
| |
Similar gains in performance as the SSSE3 version
Signed-off-by: James Almer <jamrial@gmail.com>
|
|
|
|
|
|
|
|
| |
Sample timings on ped1080p.webm (of the ssse3 functions):
iadst_idct: 4672 -> 1175 cycles
idct_iadst: 4736 -> 1263 cycles
iadst_iadst: 4924 -> 1438 cycles
Total decoding time changed from 6.565s to 6.413s.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
4412 decicycles in ff_vp9_loop_filter_h_16_16_ssse3, 4193462 runs, 842 skips
3600 decicycles in ff_vp9_loop_filter_h_16_16_avx, 4193621 runs, 683 skips
3010 decicycles in ff_vp9_loop_filter_v_16_16_ssse3, 4193528 runs, 776 skips
2678 decicycles in ff_vp9_loop_filter_v_16_16_avx, 4193742 runs, 562 skips
23025 decicycles in ff_vp9_idct_idct_32x32_add_ssse3, 2096871 runs, 281 skips
19943 decicycles in ff_vp9_idct_idct_32x32_add_avx, 2096815 runs, 337 skips
4675 decicycles in ff_vp9_idct_idct_16x16_add_ssse3, 4194018 runs, 286 skips
3980 decicycles in ff_vp9_idct_idct_16x16_add_avx, 4194022 runs, 282 skips
967 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 16776972 runs, 244 skips
887 decicycles in ff_vp9_idct_idct_8x8_add_avx, 16777002 runs, 214 skips
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
16662 decicycles in loop_filter_h_16_16_c, 8387355 runs, 1253 skips
17510 decicycles in loop_filter_v_16_16_c, 8387516 runs, 1092 skips
4941 decicycles in ff_vp9_loop_filter_h_16_16_ssse3, 8387887 runs, 721 skips
3899 decicycles in ff_vp9_loop_filter_v_16_16_ssse3, 8387980 runs, 628 skips
Overall decode time goes from:
./ffmpeg -v 0 -nostats -threads 1 -i ~/samples/vp9/ped1080p.webm -f null - 8.10s user 0.02s system 99% cpu 8.126 total
to:
./ffmpeg -v 0 -nostats -threads 1 -i ~/samples/vp9/ped1080p.webm -f null - 6.15s user 0.04s system 99% cpu 6.199 total
(46 to 61 fps)
|
|\
| |
| |
| |
| |
| |
| | |
* commit 'b0be1ae792ac8bbfb0fc7b9b9cb39eaf0feb489b':
x86: avcodec: Add a bunch of missing #includes for av_cold
Merged-by: Michael Niedermayer <michaelni@gmx.at>
|
| | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Originally written by Ronald S. Bultje <rsbultje@gmail.com> and
Clément Bœsch <u@pkh.me>
Further contributions by:
Anton Khirnov <anton@khirnov.net>
Diego Biurrun <diego@biurrun.de>
Luca Barbato <lu_zero@gentoo.org>
Martin Storsjö <martin@martin.st>
Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
Signed-off-by: Anton Khirnov <anton@khirnov.net>
|
|
|
|
|
|
|
|
|
|
| |
Sub-IDCTs will follow later. ped1080.webm goes from 9.295s to 8.191s
(13.5% faster). The IDCT itself goes from 4372 (intra) or 4337 (inter)
to 403 (intra) or 329 (inter) cycles for the DC-only form, 23755 (intra)
or 23723 (inter) to 3497 (intra) or 3607 (inter) cycles for the no-DC
form, which averages from 23393 (intra) or 16612 (inter) to 3449 (intra)
or 2392 (inter) for all 32x32s together, i.e. about ~7x faster (all
tests done on ped1080p.webm).
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Cycle counts for large MCs (old -> new on ped1080p.webm, mx!=0&&my!=0):
16x8: 876 -> 870 (0.7%)
16x16: 1444 -> 1435 (0.7%)
16x32: 2784 -> 2748 (1.3%)
32x16: 2455 -> 2349 (4.5%)
32x32: 4641 -> 4084 (13.6%)
32x64: 9200 -> 7834 (17.4%)
64x32: 8980 -> 7197 (24.8%)
64x64: 17330 -> 13796 (25.6%)
Total decoding time goes from 9.326sec to 9.182sec.
|
|
|
|
|
|
|
| |
Currently only dc-only and full 16x16. Other subforms will follow in the
near future. Total decoding time of ped1080p.webm goes from 9.7 to 9.3
seconds. DC-only goes from 957 -> 131 cycles, and the full IDCT goes
from ~4050 to ~745 cycles.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Original fix by one of these developers:
Anton Khirnov <anton@khirnov.net>
Diego Biurrun <diego@biurrun.de>
Luca Barbato <lu_zero@gentoo.org>
Martin Storsjö <martin@martin.st>
See 97962b2 / 72ca830
Personnal guess is Diego Biurrun.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
1789 decicycles in idct_idct_4x4_add_c, 262136 runs, 8 skips
1839 decicycles in idct_idct_4x4_add_c, 524270 runs, 18 skips
1864 decicycles in idct_idct_4x4_add_c, 1048548 runs, 28 skips
529 decicycles in ff_vp9_idct_idct_4x4_add_ssse3, 262138 runs, 6 skips
516 decicycles in ff_vp9_idct_idct_4x4_add_ssse3, 524282 runs, 6 skips
474 decicycles in ff_vp9_idct_idct_4x4_add_ssse3, 1048565 runs, 11 skips
(~3.9x faster)
7726 decicycles in idct_idct_8x8_add_c, 1048433 runs, 143 skips
7732 decicycles in idct_idct_8x8_add_c, 2096882 runs, 270 skips
7731 decicycles in idct_idct_8x8_add_c, 4193772 runs, 532 skips
1145 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 1048549 runs, 27 skips
1137 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 2097097 runs, 55 skips
1086 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 4194188 runs, 116 skips
(~7.1x faster)
Overall decode time before commit:
16.48s user 0.03s system 99% cpu 16.526 total
16.54s user 0.01s system 99% cpu 16.566 total
16.46s user 0.03s system 99% cpu 16.511 total
Overall decode time after commit:
16.34s user 0.02s system 99% cpu 16.378 total
16.28s user 0.02s system 99% cpu 16.315 total
16.32s user 0.03s system 99% cpu 16.366 total
Tested on i7 920 with 40s 1080p footage.
|
|
|
|
| |
Decoding time of ped1080p.webm goes from 11.3sec to 11.1sec.
|
|
Decoding time of ped1080p.webm goes from 20.7sec to 11.3sec.
|