summaryrefslogtreecommitdiff
path: root/libavcodec
Commit message (Collapse)AuthorAge
* lavc: use av_cpu_max_align() instead of hardcoding alignment requirementsAnton Khirnov2017-02-11
|
* arm: vp9lpf: Use orrs instead of orr+cmpMartin Storsjö2017-02-11
| | | | Signed-off-by: Martin Storsjö <martin@martin.st>
* arm/aarch64: vp9lpf: Calculate !hev directlyMartin Storsjö2017-02-11
| | | | | | | | | | | | | | | | | | | | Previously we first calculated hev, and then negated it. Since we were able to schedule the negation in the middle of another calculation, we don't see any gain in all cases. Before: Cortex A7 A8 A9 A53 A53/AArch64 vp9_loop_filter_v_4_8_neon: 147.0 129.0 115.8 89.0 88.7 vp9_loop_filter_v_8_8_neon: 242.0 198.5 174.7 140.0 136.7 vp9_loop_filter_v_16_8_neon: 500.0 419.5 382.7 293.0 275.7 vp9_loop_filter_v_16_16_neon: 971.2 825.5 731.5 579.0 453.0 After: vp9_loop_filter_v_4_8_neon: 143.0 127.7 114.8 88.0 87.7 vp9_loop_filter_v_8_8_neon: 241.0 197.2 173.7 140.0 136.7 vp9_loop_filter_v_16_8_neon: 497.0 419.5 379.7 293.0 275.7 vp9_loop_filter_v_16_16_neon: 965.2 818.7 731.4 579.0 452.0 Signed-off-by: Martin Storsjö <martin@martin.st>
* aarch64: vp9itxfm: Optimize 16x16 and 32x32 idct dc by unrollingMartin Storsjö2017-02-11
| | | | | | | | | | | | | This work is sponsored by, and copyright, Google. Before: Cortex A53 vp9_inv_dct_dct_16x16_sub1_add_neon: 235.3 vp9_inv_dct_dct_32x32_sub1_add_neon: 555.1 After: vp9_inv_dct_dct_16x16_sub1_add_neon: 180.2 vp9_inv_dct_dct_32x32_sub1_add_neon: 475.3 Signed-off-by: Martin Storsjö <martin@martin.st>
* arm: vp9itxfm: Optimize 16x16 and 32x32 idct dc by unrollingMartin Storsjö2017-02-11
| | | | | | | | | | | | | This work is sponsored by, and copyright, Google. Before: Cortex A7 A8 A9 A53 vp9_inv_dct_dct_16x16_sub1_add_neon: 273.0 189.5 211.7 235.8 vp9_inv_dct_dct_32x32_sub1_add_neon: 752.0 459.2 862.2 553.9 After: vp9_inv_dct_dct_16x16_sub1_add_neon: 226.5 145.0 225.1 171.8 vp9_inv_dct_dct_32x32_sub1_add_neon: 721.2 415.7 727.6 475.0 Signed-off-by: Martin Storsjö <martin@martin.st>
* aarch64: vp9mc: Calculate less unused data in the 4 pixel wide horizontal filterMartin Storsjö2017-02-11
| | | | | | No measured speedup on a Cortex A53, but other cores might benefit. Signed-off-by: Martin Storsjö <martin@martin.st>
* arm: vp9mc: Calculate less unused data in the 4 pixel wide horizontal filterMartin Storsjö2017-02-11
| | | | | | | | | Before: Cortex A7 A8 A9 A53 vp9_put_8tap_smooth_4h_neon: 378.1 273.2 340.7 229.5 After: vp9_put_8tap_smooth_4h_neon: 352.1 222.2 290.5 229.5 Signed-off-by: Martin Storsjö <martin@martin.st>
* aarch64: vp9mc: Simplify the extmla macro parametersMartin Storsjö2017-02-11
| | | | | | | | | | | | Fold the field lengths into the macro. This makes the macro invocations much more readable, when the lines are shorter. This also makes it easier to use only half the registers within the macro. Signed-off-by: Martin Storsjö <martin@martin.st>
* utvideodec: Add a missing includeMartin Storsjö2017-02-10
| | | | | | This was missing from 77c23704c76, fixing building. Signed-off-by: Martin Storsjö <martin@martin.st>
* nvenc: make gpu indices independent of supported capabilitiesTimo Rothenpieler2017-02-09
| | | | | | Do not allocate a CUDA context for every available gpu. Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
* avcodec: Mark some codecs with threadsafe init as suchDerek Buitenhuis2017-02-09
| | | | | Signed-off-by: Derek Buitenhuis <derek.buitenhuis@gmail.com> Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
* aarch64: vp9itxfm: Fix incorrect vertical alignmentMartin Storsjö2017-02-09
| | | | Signed-off-by: Martin Storsjö <martin@martin.st>
* aarch64: vp9itxfm: Update a comment to refer to a register with a different nameMartin Storsjö2017-02-09
| | | | Signed-off-by: Martin Storsjö <martin@martin.st>
* aarch64: vp9itxfm: Use the right lane sizes in 8x8 for improved readabilityMartin Storsjö2017-02-09
| | | | Signed-off-by: Martin Storsjö <martin@martin.st>
* aarch64: vp9itxfm: Use a single lane ld1 instead of ld1r where possibleMartin Storsjö2017-02-09
| | | | | | | | | The ld1r is a leftover from the arm version, where this trick is beneficial on some cores. Use a single-lane load where we don't need the semantics of ld1r. Signed-off-by: Martin Storsjö <martin@martin.st>
* aarch64: vp9itxfm: Share instructions for loading idct coeffs in the 8x8 ↵Martin Storsjö2017-02-09
| | | | | | function Signed-off-by: Martin Storsjö <martin@martin.st>
* arm: vp9itxfm: Share instructions for loading idct coeffs in the 8x8 functionMartin Storsjö2017-02-09
| | | | Signed-off-by: Martin Storsjö <martin@martin.st>
* aarch64: vp9itxfm: Do separate functions for half/quarter idct16 and idct32Martin Storsjö2017-02-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This work is sponsored by, and copyright, Google. This avoids loading and calculating coefficients that we know will be zero, and avoids filling the temp buffer with zeros in places where we know the second pass won't read. This gives a pretty substantial speedup for the smaller subpartitions. The code size increases from 14740 bytes to 24292 bytes. The idct16/32_end macros are moved above the individual functions; the instructions themselves are unchanged, but since new functions are added at the same place where the code is moved from, the diff looks rather messy. Before: vp9_inv_dct_dct_16x16_sub1_add_neon: 236.7 vp9_inv_dct_dct_16x16_sub2_add_neon: 1051.0 vp9_inv_dct_dct_16x16_sub4_add_neon: 1051.0 vp9_inv_dct_dct_16x16_sub8_add_neon: 1051.0 vp9_inv_dct_dct_16x16_sub12_add_neon: 1387.4 vp9_inv_dct_dct_16x16_sub16_add_neon: 1387.6 vp9_inv_dct_dct_32x32_sub1_add_neon: 554.1 vp9_inv_dct_dct_32x32_sub2_add_neon: 5198.5 vp9_inv_dct_dct_32x32_sub4_add_neon: 5198.6 vp9_inv_dct_dct_32x32_sub8_add_neon: 5196.3 vp9_inv_dct_dct_32x32_sub12_add_neon: 6183.4 vp9_inv_dct_dct_32x32_sub16_add_neon: 6174.3 vp9_inv_dct_dct_32x32_sub20_add_neon: 7151.4 vp9_inv_dct_dct_32x32_sub24_add_neon: 7145.3 vp9_inv_dct_dct_32x32_sub28_add_neon: 8119.3 vp9_inv_dct_dct_32x32_sub32_add_neon: 8118.7 After: vp9_inv_dct_dct_16x16_sub1_add_neon: 236.7 vp9_inv_dct_dct_16x16_sub2_add_neon: 640.8 vp9_inv_dct_dct_16x16_sub4_add_neon: 639.0 vp9_inv_dct_dct_16x16_sub8_add_neon: 842.0 vp9_inv_dct_dct_16x16_sub12_add_neon: 1388.3 vp9_inv_dct_dct_16x16_sub16_add_neon: 1389.3 vp9_inv_dct_dct_32x32_sub1_add_neon: 554.1 vp9_inv_dct_dct_32x32_sub2_add_neon: 3685.5 vp9_inv_dct_dct_32x32_sub4_add_neon: 3685.1 vp9_inv_dct_dct_32x32_sub8_add_neon: 3684.4 vp9_inv_dct_dct_32x32_sub12_add_neon: 5312.2 vp9_inv_dct_dct_32x32_sub16_add_neon: 5315.4 vp9_inv_dct_dct_32x32_sub20_add_neon: 7154.9 vp9_inv_dct_dct_32x32_sub24_add_neon: 7154.5 vp9_inv_dct_dct_32x32_sub28_add_neon: 8126.6 vp9_inv_dct_dct_32x32_sub32_add_neon: 8127.2 Signed-off-by: Martin Storsjö <martin@martin.st>
* arm: vp9itxfm: Do a simpler half/quarter idct16/idct32 when possibleMartin Storsjö2017-02-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This work is sponsored by, and copyright, Google. This avoids loading and calculating coefficients that we know will be zero, and avoids filling the temp buffer with zeros in places where we know the second pass won't read. This gives a pretty substantial speedup for the smaller subpartitions. The code size increases from 12388 bytes to 19784 bytes. The idct16/32_end macros are moved above the individual functions; the instructions themselves are unchanged, but since new functions are added at the same place where the code is moved from, the diff looks rather messy. Before: Cortex A7 A8 A9 A53 vp9_inv_dct_dct_16x16_sub1_add_neon: 273.0 189.5 212.0 235.8 vp9_inv_dct_dct_16x16_sub2_add_neon: 2102.1 1521.7 1736.2 1265.8 vp9_inv_dct_dct_16x16_sub4_add_neon: 2104.5 1533.0 1736.6 1265.5 vp9_inv_dct_dct_16x16_sub8_add_neon: 2484.8 1828.7 2014.4 1506.5 vp9_inv_dct_dct_16x16_sub12_add_neon: 2851.2 2117.8 2294.8 1753.2 vp9_inv_dct_dct_16x16_sub16_add_neon: 3239.4 2408.3 2543.5 1994.9 vp9_inv_dct_dct_32x32_sub1_add_neon: 758.3 456.7 864.5 553.9 vp9_inv_dct_dct_32x32_sub2_add_neon: 10776.7 7949.8 8567.7 6819.7 vp9_inv_dct_dct_32x32_sub4_add_neon: 10865.6 8131.5 8589.6 6816.3 vp9_inv_dct_dct_32x32_sub8_add_neon: 12053.9 9271.3 9387.7 7564.0 vp9_inv_dct_dct_32x32_sub12_add_neon: 13328.3 10463.2 10217.0 8321.3 vp9_inv_dct_dct_32x32_sub16_add_neon: 14176.4 11509.5 11018.7 9062.3 vp9_inv_dct_dct_32x32_sub20_add_neon: 15301.5 12999.9 11855.1 9828.2 vp9_inv_dct_dct_32x32_sub24_add_neon: 16482.7 14931.5 12650.1 10575.0 vp9_inv_dct_dct_32x32_sub28_add_neon: 17589.5 15811.9 13482.8 11333.4 vp9_inv_dct_dct_32x32_sub32_add_neon: 18696.2 17049.2 14355.6 12089.7 After: vp9_inv_dct_dct_16x16_sub1_add_neon: 273.0 189.5 211.7 235.8 vp9_inv_dct_dct_16x16_sub2_add_neon: 1203.5 998.2 1035.3 763.0 vp9_inv_dct_dct_16x16_sub4_add_neon: 1203.5 998.1 1035.5 760.8 vp9_inv_dct_dct_16x16_sub8_add_neon: 1926.1 1610.6 1722.1 1271.7 vp9_inv_dct_dct_16x16_sub12_add_neon: 2873.2 2129.7 2285.1 1757.3 vp9_inv_dct_dct_16x16_sub16_add_neon: 3221.4 2520.3 2557.6 2002.1 vp9_inv_dct_dct_32x32_sub1_add_neon: 753.0 457.5 866.6 554.6 vp9_inv_dct_dct_32x32_sub2_add_neon: 7554.6 5652.4 6048.4 4920.2 vp9_inv_dct_dct_32x32_sub4_add_neon: 7549.9 5685.0 6046.9 4925.7 vp9_inv_dct_dct_32x32_sub8_add_neon: 8336.9 6704.5 6604.0 5478.0 vp9_inv_dct_dct_32x32_sub12_add_neon: 10914.0 9777.2 9240.4 7416.9 vp9_inv_dct_dct_32x32_sub16_add_neon: 11859.2 11223.3 9966.3 8095.1 vp9_inv_dct_dct_32x32_sub20_add_neon: 15237.1 13029.4 11838.3 9829.4 vp9_inv_dct_dct_32x32_sub24_add_neon: 16293.2 14379.8 12644.9 10572.0 vp9_inv_dct_dct_32x32_sub28_add_neon: 17424.3 15734.7 13473.0 11326.9 vp9_inv_dct_dct_32x32_sub32_add_neon: 18531.3 17457.0 14298.6 12080.0 Signed-off-by: Martin Storsjö <martin@martin.st>
* aarch64: vp9itxfm: Move the load_add_store macro out from the itxfm16 pass2 ↵Martin Storsjö2017-02-09
| | | | | | | | | function This allows reusing the macro for a separate implementation of the pass2 function. Signed-off-by: Martin Storsjö <martin@martin.st>
* arm: vp9itxfm: Move the load_add_store macro out from the itxfm16 pass2 functionMartin Storsjö2017-02-09
| | | | | | | This allows reusing the macro for a separate implementation of the pass2 function. Signed-off-by: Martin Storsjö <martin@martin.st>
* aarch64: vp9itxfm: Make the larger core transforms standalone functionsMartin Storsjö2017-02-09
| | | | | | | | | | | | | | | | | | | | | | | | This work is sponsored by, and copyright, Google. This reduces the code size of libavcodec/aarch64/vp9itxfm_neon.o from 19496 to 14740 bytes. This gives a small slowdown of a couple of tens of cycles, but makes it more feasible to add more optimized versions of these transforms. Before: vp9_inv_dct_dct_16x16_sub4_add_neon: 1036.7 vp9_inv_dct_dct_16x16_sub16_add_neon: 1372.2 vp9_inv_dct_dct_32x32_sub4_add_neon: 5180.0 vp9_inv_dct_dct_32x32_sub32_add_neon: 8095.7 After: vp9_inv_dct_dct_16x16_sub4_add_neon: 1051.0 vp9_inv_dct_dct_16x16_sub16_add_neon: 1390.1 vp9_inv_dct_dct_32x32_sub4_add_neon: 5199.9 vp9_inv_dct_dct_32x32_sub32_add_neon: 8125.8 Signed-off-by: Martin Storsjö <martin@martin.st>
* arm: vp9itxfm: Make the larger core transforms standalone functionsMartin Storsjö2017-02-09
| | | | | | | | | | | | | | | | | | | | | | | | | This work is sponsored by, and copyright, Google. This reduces the code size of libavcodec/arm/vp9itxfm_neon.o from 15324 to 12388 bytes. This gives a small slowdown of a couple tens of cycles, up to around 150 cycles for the full case of the largest transform, but makes it more feasible to add more optimized versions of these transforms. Before: Cortex A7 A8 A9 A53 vp9_inv_dct_dct_16x16_sub4_add_neon: 2063.4 1516.0 1719.5 1245.1 vp9_inv_dct_dct_16x16_sub16_add_neon: 3279.3 2454.5 2525.2 1982.3 vp9_inv_dct_dct_32x32_sub4_add_neon: 10750.0 7955.4 8525.6 6754.2 vp9_inv_dct_dct_32x32_sub32_add_neon: 18574.0 17108.4 14216.7 12010.2 After: vp9_inv_dct_dct_16x16_sub4_add_neon: 2060.8 1608.5 1735.7 1262.0 vp9_inv_dct_dct_16x16_sub16_add_neon: 3211.2 2443.5 2546.1 1999.5 vp9_inv_dct_dct_32x32_sub4_add_neon: 10682.0 8043.8 8581.3 6810.1 vp9_inv_dct_dct_32x32_sub32_add_neon: 18522.4 17277.4 14286.7 12087.9 Signed-off-by: Martin Storsjö <martin@martin.st>
* omx: Use the EOS flag to handle flushing at the endMartin Storsjö2017-02-08
| | | | | | | | This avoids having to count the number of frames sent to the codec and the number of output packets received; instead just wait until the encoder returns a buffer with the EOS flag set. Signed-off-by: Martin Storsjö <martin@martin.st>
* Use bitstream_init8() where appropriateDiego Biurrun2017-02-07
|
* wma: Convert to the new bitstream readerAlexandra Hájková2017-02-06
|
* aarch64: vp9itxfm: Restructure the idct32 store macrosMartin Storsjö2017-02-05
| | | | | | | | | This avoids concatenation, which can't be used if the whole macro is wrapped within another macro. This is also arguably more readable. Signed-off-by: Martin Storsjö <martin@martin.st>
* arm: vp9itxfm: Avoid .irp when it doesn't save any linesMartin Storsjö2017-02-05
| | | | | | This makes it more readable. Signed-off-by: Martin Storsjö <martin@martin.st>
* asm: Consistently uppercase SECTION markersDiego Biurrun2017-02-03
|
* svq3: Convert to the new bitstream readerAlexandra Hájková2017-02-02
|
* lavc: deprecate refcounted_frames fieldwm42017-02-01
| | | | | | | | | No deprecation guards, because the old decode API (for which this field is needed) doesn't have any either. This field should be removed together with the old decode calls. Signed-off-by: Anton Khirnov <anton@khirnov.net>
* Mark some arrays that never change as const.Anton Khirnov2017-02-01
|
* ffv1: Convert to the new bitstream readerAlexandra Hájková2017-01-31
|
* h261dec: Convert to the new bitstream readerAlexandra Hájková2017-01-31
|
* shorten: Convert to the new bitstream readerAlexandra Hájková2017-01-31
|
* ralf: Convert to the new bitstream readerAlexandra Hájková2017-01-31
|
* loco: Convert to the new bitstream readerAlexandra Hájková2017-01-31
|
* fic: Convert to the new bitstream readerAlexandra Hájková2017-01-31
|
* dirac: Convert to the new bitstream readerAlexandra Hájková2017-01-31
|
* cavs: Convert to the new bitstream readerAlexandra Hájková2017-01-31
|
* aic: Convert to the new bitstream readerAlexandra Hájková2017-01-31
|
* golomb: Convert to the new bitstream readerDiego Biurrun2017-01-31
|
* pgssubdec: reset rle_data_len/rle_remaining_len on allocation errorAndreas Cadhalpun2017-01-31
| | | | | | | | The code relies on their validity and otherwise can try to access a NULL object->rle pointer, causing segmentation faults. Signed-off-by: Andreas Cadhalpun <Andreas.Cadhalpun@googlemail.com> Signed-off-by: Diego Biurrun <diego@biurrun.de>
* vaapi_encode: Add VP8 supportMark Thompson2017-01-30
|
* vaapi_encode: Pass framerate parameters to driverMark Thompson2017-01-30
| | | | | | | Only do this when building for a recent VAAPI version - initial driver implementations were confused about the interpretation of the framerate field, but hopefully this will be consistent everywhere once 0.40.0 is released.
* vaapi_h264: Enable VBR modeMark Thompson2017-01-30
| | | | | | | | Default to using VBR when a target bitrate is set, unless the max rate is also set and matches the target. Changes to the Intel driver mean that min_qp is also respected in this case, so set a codec default to unset the value rather than using the current default inherited from the MPEG-4 part 2 encoder.
* vaapi_encode: Support VBR modeMark Thompson2017-01-30
| | | | | | This includes a backward-compatibility hack to choose CBR anyway on old drivers which have no CBR support, so that existing programs will continue to work their options now map to VBR.
* vaapi_encode: Add MPEG-2 supportMark Thompson2017-01-29
|
* tak: Convert to the new bitstream readerAlexandra Hájková2017-01-25
|
* magicyuv: Convert to the new bitstream readerDiego Biurrun2017-01-25
|