summaryrefslogtreecommitdiff
path: root/libavcodec/x86/vp9dsp_init.c
Commit message (Collapse)AuthorAge
* vp9/x86: make filter_16_h work on 32-bit.Ronald S. Bultje2014-12-27
|
* vp9/x86: make filter_48/84/88_h work on 32-bit.Ronald S. Bultje2014-12-27
|
* vp9/x86: make filter_44_h work on 32-bit.Ronald S. Bultje2014-12-27
|
* vp9/x86: make filter_16_v work on 32-bit.Ronald S. Bultje2014-12-27
|
* vp9/x86: make filter_48/84_v work on 32-bit.Ronald S. Bultje2014-12-27
|
* vp9/x86: make filter_88_v work on 32-bit.Ronald S. Bultje2014-12-27
|
* vp9/x86: make filter_44_v work on 32-bit.Ronald S. Bultje2014-12-27
|
* x86/vp9: remove duplicate function prototypesJames Almer2014-12-23
| | | | | | Fixes "redundant redeclaration" warnings. Signed-off-by: James Almer <jamrial@gmail.com>
* vp9/x86: intra prediction sse2/32bit support.Ronald S. Bultje2014-12-19
| | | | Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
* vp9/x86: sse2 MC assembly.Ronald S. Bultje2014-12-15
| | | | | | | Also a slight change to the ssse3 code, which prevents a theoretical overflow in the sharp filter. Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
* vp9/x86: 32bit and sse2 support for vp9 inverse transform assemblyRonald S. Bultje2014-12-15
| | | | Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
* x86/vp9: add AVX and AVX2 MCJames Almer2014-09-22
| | | | | | | Roughly 25% faster MC than ssse3 for blocksizes 32 and 64. Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com> Signed-off-by: James Almer <jamrial@gmail.com>
* x86/vp9: inital AVX2 intra_predJames Almer2014-06-08
| | | | | | | | | | | | | | | | | | | | | | | | | tos3k-vp9-b10000.webm on a Core i5-4200U @1.6GHz 1219 decicycles in ff_vp9_ipred_dc_32x32_ssse3, 131070 runs, 2 skips 439 decicycles in ff_vp9_ipred_dc_32x32_avx2, 131070 runs, 2 skips 3570 decicycles in ff_vp9_ipred_dc_top_32x32_ssse3, 4096 runs, 0 skips 2494 decicycles in ff_vp9_ipred_dc_top_32x32_avx2, 4096 runs, 0 skips 1419 decicycles in ff_vp9_ipred_dc_left_32x32_ssse3, 16384 runs, 0 skips 717 decicycles in ff_vp9_ipred_dc_left_32x32_avx2, 16384 runs, 0 skips 2737 decicycles in ff_vp9_ipred_tm_32x32_avx, 1024 runs, 0 skips 2088 decicycles in ff_vp9_ipred_tm_32x32_avx2, 1024 runs, 0 skips 3090 decicycles in ff_vp9_ipred_v_32x32_avx, 512 runs, 0 skips 2226 decicycles in ff_vp9_ipred_v_32x32_avx2, 512 runs, 0 skips 1565 decicycles in ff_vp9_ipred_h_32x32_avx, 1024 runs, 0 skips 922 decicycles in ff_vp9_ipred_h_32x32_avx2, 1024 runs, 0 skips Signed-off-by: James Almer <jamrial@gmail.com> Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
* x86/vp9mc: add vp9 namespace.Clément Bœsch2014-03-29
|
* vp9/x86: intra prediction SIMD.Ronald S. Bultje2014-02-17
| | | | | Partially based on h264_intrapred. (I hope to eventually merge these two intrapred implementations back together.)
* x86/vp9lpf: add ff_vp9_loop_filter_[vh]_44_16_{sse2,ssse3,avx}.Clément Bœsch2014-02-05
|
* x86/vp9lpf: add ff_vp9_loop_filter_h_{48,84}_16_{sse2,ssse3,avx}().Clément Bœsch2014-01-30
| | | | | 5.40s → 5.30s overall decode time with -threads 1 on ped1080p.webm (i7 920, ssse3)
* x86/vp9lpf: add ff_vp9_loop_filter_[vh]_88_16_sse2()James Almer2014-01-28
| | | | | | Similar gains as the ssse3 version once again Signed-off-by: James Almer <jamrial@gmail.com>
* x86/vp9lpf: add ff_vp9_loop_filter_[vh]_88_16_{ssse3,avx}.Clément Bœsch2014-01-28
| | | | | | | | | | | | | 9680 decicycles in loop_filter_v_88_16_c, 4193765 runs, 539 skips 9233 decicycles in loop_filter_h_88_16_c, 4193751 runs, 553 skips 1929 decicycles in ff_vp9_loop_filter_v_88_16_ssse3, 4194118 runs, 186 skips 2738 decicycles in ff_vp9_loop_filter_h_88_16_ssse3, 4193861 runs, 443 skips 5.978 → 5.417 overall decode time on ped1080p.webm (-threads 1) Adding SSE2 support should be relatively trivial (just a matter of changing the pshufb [mask_mix] with something else), patch welcome.
* vp9/x86: iwht4x4 (lossless) mmx.Ronald S. Bultje2014-01-24
|
* vp9/x86: 4x4 iadst SIMD (ssse3) variants.Ronald S. Bultje2014-01-24
| | | | | | | | Cycle measurements for intra itxfm_4x4_add on ped1080p.webm: idct_idct: 66 -> 67 cycles (noise measurement) idct_iadst: 199 -> 79 cycles iadst_idct: 165 -> 70 cycles iadst_iadst: 183 -> 82 cycles
* vp9/x86: 8x8 iadst SIMD (ssse3/avx) variants.Ronald S. Bultje2014-01-24
| | | | | | | | Cycle measurements for intra itxfm_8x8_add on ped1080p.webm: idct_idct: 133 -> 135 cycles (noise measurement) idct_iadst: 900 -> 241 cycles iadst_idct: 864 -> 215 cycles iadst_iadst: 973 -> 310 cycles
* vp9/x86: rename ff_avg[48]_sse to ff_avg[48]_mmxextJames Almer2014-01-18
| | | | | | | | pavgb is an sse integer instruction, so the mmxext flag is enough Signed-off-by: James Almer <jamrial@gmail.com> Reviewed-by: "Ronald S. Bultje" <rsbultje@gmail.com> Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
* vp9/x86: add ff_vp9_loop_filter_[vh]_16_16_sse2().James Almer2014-01-17
| | | | | | Similar gains in performance as the SSSE3 version Signed-off-by: James Almer <jamrial@gmail.com>
* vp9/x86: 16x16 iadst_idct, idct_iadst and iadst_iadst (ssse3+avx).Ronald S. Bultje2014-01-16
| | | | | | | | Sample timings on ped1080p.webm (of the ssse3 functions): iadst_idct: 4672 -> 1175 cycles idct_iadst: 4736 -> 1263 cycles iadst_iadst: 4924 -> 1438 cycles Total decoding time changed from 6.565s to 6.413s.
* vp9/x86: add AVX for itxfm and lpf.Clément Bœsch2014-01-15
| | | | | | | | | | | | | | | | | 4412 decicycles in ff_vp9_loop_filter_h_16_16_ssse3, 4193462 runs, 842 skips 3600 decicycles in ff_vp9_loop_filter_h_16_16_avx, 4193621 runs, 683 skips 3010 decicycles in ff_vp9_loop_filter_v_16_16_ssse3, 4193528 runs, 776 skips 2678 decicycles in ff_vp9_loop_filter_v_16_16_avx, 4193742 runs, 562 skips 23025 decicycles in ff_vp9_idct_idct_32x32_add_ssse3, 2096871 runs, 281 skips 19943 decicycles in ff_vp9_idct_idct_32x32_add_avx, 2096815 runs, 337 skips 4675 decicycles in ff_vp9_idct_idct_16x16_add_ssse3, 4194018 runs, 286 skips 3980 decicycles in ff_vp9_idct_idct_16x16_add_avx, 4194022 runs, 282 skips 967 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 16776972 runs, 244 skips 887 decicycles in ff_vp9_idct_idct_8x8_add_avx, 16777002 runs, 214 skips
* vp9/x86: add ff_vp9_loop_filter_[vh]_16_16_ssse3().Clément Bœsch2014-01-12
| | | | | | | | | | | | | | | 16662 decicycles in loop_filter_h_16_16_c, 8387355 runs, 1253 skips 17510 decicycles in loop_filter_v_16_16_c, 8387516 runs, 1092 skips 4941 decicycles in ff_vp9_loop_filter_h_16_16_ssse3, 8387887 runs, 721 skips 3899 decicycles in ff_vp9_loop_filter_v_16_16_ssse3, 8387980 runs, 628 skips Overall decode time goes from: ./ffmpeg -v 0 -nostats -threads 1 -i ~/samples/vp9/ped1080p.webm -f null - 8.10s user 0.02s system 99% cpu 8.126 total to: ./ffmpeg -v 0 -nostats -threads 1 -i ~/samples/vp9/ped1080p.webm -f null - 6.15s user 0.04s system 99% cpu 6.199 total (46 to 61 fps)
* Merge commit 'b0be1ae792ac8bbfb0fc7b9b9cb39eaf0feb489b'Michael Niedermayer2014-01-09
|\ | | | | | | | | | | | | * commit 'b0be1ae792ac8bbfb0fc7b9b9cb39eaf0feb489b': x86: avcodec: Add a bunch of missing #includes for av_cold Merged-by: Michael Niedermayer <michaelni@gmx.at>
| * x86: avcodec: Add a bunch of missing #includes for av_coldDiego Biurrun2014-01-09
| |
| * lavc: VP9 decoderRonald S. Bultje2013-11-15
| | | | | | | | | | | | | | Originally written by Ronald S. Bultje <rsbultje@gmail.com> and Clément Bœsch <u@pkh.me> Further contributions by: Anton Khirnov <anton@khirnov.net> Diego Biurrun <diego@biurrun.de> Luca Barbato <lu_zero@gentoo.org> Martin Storsjö <martin@martin.st> Signed-off-by: Luca Barbato <lu_zero@gentoo.org> Signed-off-by: Anton Khirnov <anton@khirnov.net>
* vp9/x86: idct_32x32_add_ssse3.Ronald S. Bultje2014-01-07
| | | | | | | | | | Sub-IDCTs will follow later. ped1080.webm goes from 9.295s to 8.191s (13.5% faster). The IDCT itself goes from 4372 (intra) or 4337 (inter) to 403 (intra) or 329 (inter) cycles for the DC-only form, 23755 (intra) or 23723 (inter) to 3497 (intra) or 3607 (inter) cycles for the no-DC form, which averages from 23393 (intra) or 16612 (inter) to 3449 (intra) or 2392 (inter) for all 32x32s together, i.e. about ~7x faster (all tests done on ped1080p.webm).
* vp9/x86: 16px MC functions (64bit only).Ronald S. Bultje2013-12-26
| | | | | | | | | | | | | Cycle counts for large MCs (old -> new on ped1080p.webm, mx!=0&&my!=0): 16x8: 876 -> 870 (0.7%) 16x16: 1444 -> 1435 (0.7%) 16x32: 2784 -> 2748 (1.3%) 32x16: 2455 -> 2349 (4.5%) 32x32: 4641 -> 4084 (13.6%) 32x64: 9200 -> 7834 (17.4%) 64x32: 8980 -> 7197 (24.8%) 64x64: 17330 -> 13796 (25.6%) Total decoding time goes from 9.326sec to 9.182sec.
* vp9/x86: idct_add_16x16_ssse3.Ronald S. Bultje2013-12-14
| | | | | | | Currently only dc-only and full 16x16. Other subforms will follow in the near future. Total decoding time of ped1080p.webm goes from 9.7 to 9.3 seconds. DC-only goes from 957 -> 131 cycles, and the full IDCT goes from ~4050 to ~745 cycles.
* avcodec/x86/vp9dsp: use EXTERNAL_* macros.Clément Bœsch2013-11-16
| | | | | | | | | | | | Original fix by one of these developers: Anton Khirnov <anton@khirnov.net> Diego Biurrun <diego@biurrun.de> Luca Barbato <lu_zero@gentoo.org> Martin Storsjö <martin@martin.st> See 97962b2 / 72ca830 Personnal guess is Diego Biurrun.
* avcodec/vp9: add ff_vp9_idct_idct_{4x4,8x8}_ssse3().Clément Bœsch2013-11-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1789 decicycles in idct_idct_4x4_add_c, 262136 runs, 8 skips 1839 decicycles in idct_idct_4x4_add_c, 524270 runs, 18 skips 1864 decicycles in idct_idct_4x4_add_c, 1048548 runs, 28 skips 529 decicycles in ff_vp9_idct_idct_4x4_add_ssse3, 262138 runs, 6 skips 516 decicycles in ff_vp9_idct_idct_4x4_add_ssse3, 524282 runs, 6 skips 474 decicycles in ff_vp9_idct_idct_4x4_add_ssse3, 1048565 runs, 11 skips (~3.9x faster) 7726 decicycles in idct_idct_8x8_add_c, 1048433 runs, 143 skips 7732 decicycles in idct_idct_8x8_add_c, 2096882 runs, 270 skips 7731 decicycles in idct_idct_8x8_add_c, 4193772 runs, 532 skips 1145 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 1048549 runs, 27 skips 1137 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 2097097 runs, 55 skips 1086 decicycles in ff_vp9_idct_idct_8x8_add_ssse3, 4194188 runs, 116 skips (~7.1x faster) Overall decode time before commit: 16.48s user 0.03s system 99% cpu 16.526 total 16.54s user 0.01s system 99% cpu 16.566 total 16.46s user 0.03s system 99% cpu 16.511 total Overall decode time after commit: 16.34s user 0.02s system 99% cpu 16.378 total 16.28s user 0.02s system 99% cpu 16.315 total 16.32s user 0.03s system 99% cpu 16.366 total Tested on i7 920 with 40s 1080p footage.
* Full-pixel MC functions.Ronald S. Bultje2013-10-02
| | | | Decoding time of ped1080p.webm goes from 11.3sec to 11.1sec.
* VP9 MC (ssse3) optimizations.Ronald S. Bultje2013-10-02
Decoding time of ped1080p.webm goes from 20.7sec to 11.3sec.