Just a bit playing around with compression algorithms
Bytepair-Encoding is a pretty powerful and simple way to achieve good compression. Basically you’re replacing pairs consisting of two characters with one character which is not used within the input string. I ran into two problems with Bytepair-encoding.
Firstly, even if the pair is repeated three times, the resulting „compressed“ string is longer than the original string. That’s due to the applied dictionary plus control char. You need to say „aa“ was replaced by „x“. So your dictionary will eat 3 bytes per replacement (aax). To separate the dictionary from your string, you’ll add one control byte (aax@compressed-string) so if you feed it with something like: aaabcabce (9 bytes) you’ll get aaxabyycz@xzze (14 bytes – aa replaced by x, ab replaced by y, yc replaced by z – resulting string xzze). Which means, Bytepair-encoding is only efficient if there are „many“ repeated pairs (or I missed some trick).
Secondly, you might hit the problem of finding replacement characters. One byte (8 Bit) can represent a number of up to 256, which is why there are only 256 ASCII characters. If you keep in mind that some of them are already used within the document, you might want to use two bytes to support up to 2^16 possible replacement characters instead of 2^8. That, would make the replacement of „pairs“ useless, though. There’s simple no benefit in replacing two characters with two characters. It might work using UTF-8 assuming that 1 character = 2 Bytes – Then you would replace two characters (four bytes) with one character (two bytes).
With my first implementation (ascii) I was able to reduce a text of 152089 bytes to just 93463 Bytes (dictionary included). My second implementation, which would just take 4 characters and replace them with two characters using a pool of unused characters out of two bytes (65535 characters) has shown that I was able to reduce 152089 bytes to 100192 bytes. Processing this one with my ASCII implementation shows a further reduction to 91890 Bytes.
|testfile||original size||compressed size||compression ratio||space saving|
Testfiles taken from The Canterbury Corpus. Sizes in Bytes. Space savings in percent.