Indicating the end of a raw data chunk in an RLE algorithm that can contain all byte values

Question

I'm writing an RLE algorithm in C# that can work on any file as input. The approach to encoding I'm taking is as follows:

An RLE packet contains 1 byte for the length and 1 byte for the value. For example, if the byte 0xFF appeared 3 times in a row, 0x03 0xFF would be written to the file.

If representing the data as raw data would be more efficient, I use 0x00 as a terminator. This works because the length of a packet can never be zero. If I wanted to add the bytes 0x53 0x2C 0x01 to my compressed file it would look like this:

0x03 0xFF 0x00 0x53 0x2C 0x01

However a problem arises when trying to switch back to RLE packets. I can't use a byte as a terminator like I did for switching onto raw data because any byte value from 0x00 to 0xFF can be in the input data, and when decoding the bytes the decoder would misinterpret the byte as a terminator and ruin everything.

What can I do to indicate that I have to switch back to RLE packets when it can't be written as data in the file?

Here is my code if it helps:

private static void RunLengthEncode(ref byte[] bytes)
{
    // Create a list to store the bytes
    List<byte> output = new List<byte>();
    
    byte runLengthByte;
    int runLengthCounter = 0;

    // Set the RLE byte to the first byte in the array and increment the RLE counter
    runLengthByte = bytes[0];

    // For each byte in the input array...
    for (int i = 0; i < bytes.Length; i++)
    {
        if (runLengthByte == bytes[i] || runLengthCounter == 255)
        {
            runLengthCounter++;
        }
        else 
        {
            // RLE packets under 3 should be written as raw data to avoid increasing the file size
            if (runLengthCounter < 3)
            {
                // Add a 0x00 to indicate raw data
                output.Add(0x00);

                // Add the bytes that were skipped while counting the run length
                for (int j = i - runLengthCounter; j < i; j++)
                {
                    output.Add(bytes[j]);
                }
            }
            else
            {
                // Add 2 bytes, one for the number of bytes and one for the value
                output.Add((byte)runLengthCounter);
                output.Add(runLengthByte);
            }

            runLengthCounter = 1;
            runLengthByte = bytes[i];
        }
            
        // Add the last bytes to the list when finishing
        if (i == bytes.Length - 1)
        {
            // Add 2 bytes, one for the number of bytes and one for the value
            output.Add((byte)runLengthCounter);
            output.Add(runLengthByte);
        }
    }

    // Set the bytes to the RLE encoded data
    bytes = output.ToArray();
}

Also if you want to comment and say that RLE isn't very efficient for binary data, I know it isn't. This is a project I'm doing to implement many kinds of compression to learn about them, not for an actual product.

Any help would be appreciated! Thanks!

Answer 1

There are many ways to unambiguously encode run-lengths. One simple way is, when decoding: if you see two equal bytes in a row, then the next byte is aa count of repeats of that byte after those first two. Ie 0..255 additional repeats, so encoding runs of 2..257. (There's no point in encoding runs of 0 or 1.)

Indicating the end of a raw data chunk in an RLE algorithm that can contain all byte values

Question

1 answers

solution1
0 2023-01-12 16:05:58

Indicating the end of a raw data chunk in an RLE algorithm that can contain all byte values

Question

1 answers

solution1 0 2023-01-12 16:05:58

solution1
0 2023-01-12 16:05:58