简体   繁体   中英

Are both of these algorithms valid implementations of LZSS?

I am reverse engineering things and I often stumble upon various decompression algorithms. Most of times, it's LZSS just like Wikipedia describes it:

  1. Initialize dictionary of size 2^n
  2. While output is less than known output size:
    1. Read flag
    2. If the flag is set, output literal byte (and append it at the end of dictionary)
    3. If the flag is not set:
      1. Read length and look behind position
      2. Transcribe length bytes from the dictionary at look behind position to the output and at the end of dictionary.

The thing is that the implementations follow two schools of how to encode the flag . The first one treats the input as sequence of bits:

  1. (...)
    1. Read flag as one bit
    2. If it's set, read literal byte as 8 unaligned bits
    3. If it's not set, read length and position as n and m unaligned bits

This involves lots of bit shift operations.

The other one saves a little CPU time by using bitwise operations only for flag storage, whereas literal bytes, length and position are derived from aligned input bytes. To achieve this, it breaks the linearity by fetching a few flags in advance. So the algorithm is modified like this:

  1. (...)
    1. Read 8 flags at once by reading one byte. For each of these 8 flags:
      1. If it's set, read literal as aligned byte
      2. If it's not set, read length and position as aligned bytes (deriving the specific values from the fetched bytes involves some bit operations, but it's nowhere as expensive as the first version.)

My question is: are these both valid LZSS implementations, or did I identify these algorithms wrong? Are there any known names for them?

They are effectively variants on LZSS, since all use one bit to decide on literal vs. match. More generally they are variants on LZ77.

Deflate is also a variant on LZ77, which does not use a whole bit for literal vs. match. Instead deflate has a single code for the combination of literals and lengths, so the code implicitly determines whether the next thing is a literal or a match. A length code is followed by a separate distance code.

lz4 (a specific algorithm, not a family) handles byte alignment in a different way, coding the number of literals, which is necessarily followed by a match. The first byte with the number of literals also has part of the distance. The literals are byte aligned, as is the offset that follows the literals and the rest of the distance.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM