Faster CRC checks to speed up codecs |
Written by Robert | ||||||||||||||||||||||||
Sunday, 29 April 2018 22:33 | ||||||||||||||||||||||||
So, I kind of stumbled into this, but always looking for possible optimizations, I simply had to explore it... tl;dr: I accelerated checksum calculations and thus encoding times of LAME, FLAC, Ogg and Monkey's Audio using an optimized CRC algorithm. Find patches at the end of this post. These will be part of the next fre:ac 1.1 alpha release. Calculating Xing/LAME header CRCsWorking on the LAME MP3 implementation of my SuperFast technology, I came across the necessity to do CRC checksum calculations. Every MP3 created by LAME has a Xing or LAME VBR header at the beginning. It contains index points to the MP3 as well as information about duration and gapless playback. At the end of this header, there are two CRC checksums, one for the MP3 bitstream and one for the header itself. As the bitstream repacker used in SuperFast LAME changes the MP3's internal structure, an update of the Xing/LAME header's CRC values is necessary afterwards. I started with a simple implementation of the CRC16 algorithm that I wrote for the smooth Class Library. This created a small delay at the end of each conversion when the CRC for the MP3 file is updated. Not a big deal for the usually small MP3s weighting in at 3-4 MB. With larger files, however, like when converting a whole album to a single output file, it became painful. The CRC calculation added a delay of half a second for a 60 MB file on my i7 6900K system. On slower systems it would be much more. Steps to optimize the calculationThe first thing I tried was using compiler optimizations for the CRC routines (GCC's -O3 instead of -Os). This brought the delay down to about a quarter second. Still too much for my taste, though. I then started looking for optimized CRC algorithms and found Matt Stancliff's crcspeed repository. It is based on an algorithm developed by Intel that uses additional lookup tables to enable processing of multiple input bytes in a single step. There are different variants of this algorithm circling around, processing different numbers of bytes in each step, but it's generally called slicing-by-X (where X is usually 2, 4, 8 or 16). I updated my CRC implementation to use the slicing algorithm and did some measurements. The slicing-by-8 variant turned out to be roughly 10 times faster than my original version and 5 times faster than the GCC -O3 compiled one. There was very little additional speedup when using slicing-by-12 (which I found to be the fastest) or slicing-by-16, so I decided to stick with slicing-by-8 as a good compromise between speed and memory requirements. Using the slicing-by-8 algorithm reduced the delay at the end of the 60 MB MP3 conversion to just a few 10s of milliseconds. But I did not stop there... Looking furtherSo, if I have to calculate CRC checksums for the Xing/LAME header, LAME itself will have to do the same. You just don't notice a delay, because the calculation is not done at once at the end, but spread over the whole encoding process. But does LAME use an optimized CRC implementation? As it turned out, no. I updated the LAME CRC routines with the slicing-by-8 algorithm and got a speed-up of only 0.5%. Not much, but I wondered if other codecs (especially lossless ones that generate more data) might benefit more. I looked further and found non-optimal CRC implementations in FLAC, Ogg (used for Opus, Vorbis and other codecs) and Monkey's Audio. Replacing them with the optimized algorithm yielded similar results to LAME for the lossy formats. The lossless formats, however, benefit more from the optimization and are sped up by about 5% due to more data being generated. When using Ogg FLAC, the speed-up is roughly 10% due to CRC's being calculated for both, the FLAC audio frames and the Ogg container pages. So we get up to 5% speed-up in the usual case and around 10% improvement for the Ogg FLAC format. All by simply replacing the CRC algorithm with an optimized version. Technical considerationsThe original Intel algorithm and Matt Stancliff's version require separate implementations for big-endian and little-endian CPUs. I converted the algorithm to an endian-independent form, i.e. only one variant for all processors. I did not measure any significant speed difference after making the code endian-independent when compiling with optimizations turned on. My implementation uses static lookup tables for LAME, FLAC and Ogg. This blows up code size a bit and I would have preferred calculating the tables on the fly on first use. That's difficult to get right in a portable, thread safe way in plain C though, so it is used only for Monkey's Audio which is written in C++ (allowing dynamic initialization of static data). Speed gainsHere are some numbers showing relative speed gain when encoding and decoding with different codecs (all used with default settings):
* Opus and Vorbis themselves are not optimized, but use the optimized Ogg container library. The patchesHere are my patches to update the mentioned codecs' CRC calculations to the optimized slicing algorithm:
Update: The Monkey's Audio patch has been integrated in the official Monkey's Audio 4.34 release. Update 2: The Ogg and FLAC patches have been merged into to the upstream repositories and will be part of the next official releases. Here is a proof-of-concept FLAC build for Win64 for everyone to try out: flac-1.3.2-fastcrc-win64.zip The patched codecs will be used in the next fre:ac 1.1 alpha release and I will contact the maintainers of these projects to request integration of the patches in official releases.
|