简体   繁体   English

创建非贪婪LZW算法

[英]Creating a Non-greedy LZW algorithm

Basically, I'm doing an IB Extended Essay for Computer Science, and was thinking of using a non-greedy implementation of the LZW algorithm. 基本上,我正在做计算机科学的IB扩展论文,并且正在考虑使用LZW算法的非贪婪实现。 I found the following links: 我找到以下链接:

  1. https://pdfs.semanticscholar.org/4e86/59917a0cbc2ac033aced4a48948943c42246.pdf https://pdfs.semanticscholar.org/4e86/59917a0cbc2ac033aced4a48948943c42246.pdf

  2. http://theory.stanford.edu/~matias/papers/wae98.pdf http://theory.stanford.edu/~matias/papers/wae98.pdf

And have been operating under the assumption that the algorithm described in paper 1 and the LZW-FP in paper 2 are essentially the same. 并且一直在假设论文1中描述的算法与论文2中的LZW-FP基本相同的前提下进行操作。 Either way, tracing the pseudocode in paper 1 has been a painful experience that has yielded nothing, and in the words of my teacher "is incredibly difficult to understand." 无论哪种方式,在论文1中跟踪伪代码都是一种痛苦的经历,但是却一无所获,用我的老师的话说,“难以理解。” If anyone can figure out how to trace it, or happens to have studied the algorithm before and knows how it works, that'd be a great help. 如果有人能弄清楚如何跟踪它,或者碰巧曾经研究过算法并且知道它是如何工作的,那将是一个很大的帮助。

Note : I refer to what you call "paper 1" as Horspool 1995 and "paper 2" as Matias et al 1998 . 注意 :我把您称为“论文1”的称为Horspool 1995 ,将“论文2”称为Matias等1998 I only looked at the LZW algorithm in Horspool 1995, so if you were referring to the LZSS algorithm this won't help you much. 我只是在Horspool 1995中研究过LZW算法,因此,如果您指的是LZSS算法,那对您没有太大帮助。

My understanding is that Horspool's algorithm is what the authors of Matias et al 1998 call "LZW-FPA", which is different from what they call "LZW-FP"; 我的理解是Horspool的算法是Matias等人1998年的作者所称的“ LZW-FPA”,与他们所说的“ LZW-FP”不同。 the difference has to do with the way the algorithm decides which substrings to add to the dictionary. 差异与算法决定要添加到字典中的子字符串的方式有关。 Since "LZW-FP" adds exactly the same substrings to the dictionary as LZW would add, LZW-FP cannot produce a longer compressed sequence for any string. 由于“ LZW-FP”将与LZW添加的完全相同的子字符串添加到字典中,因此LZW-FP无法为任何字符串产生更长的压缩序列。 LZW-FPA (and Horspool's algorithm) add the successor string of the greedy match at each output cycle. LZW-FPA(和Horspool的算法)在每个输出周期添加贪婪匹配的后继字符串。 That's not the same substring (because the greedy match doesn't start at the same point as it would in LZW) and therefore it is theoretically possible that it will produce a longer compressed sequence than LZW. 那不是相同的子字符串(因为贪婪的匹配不会像在LZW中那样在同一点开始),因此从理论上讲,它可能会产生比LZW更长的压缩序列。

Horspool's algorithm is actually quite simple, but it suffers from the fact that there are several silly errors in the provided pseudo-code. Horspool的算法实际上非常简单,但是它受到这样一个事实的困扰,即所提供的伪代码中存在一些愚蠢的错误。 Implementing the algorithm is a good way of detecting and fixing these errors; 实现算法是检测和修复这些错误的好方法。 I put an annotated version of the pseudocode below. 我在下面放了一个带注释的伪代码版本。

LZW-like algorithms decompose the input into a sequence of blocks. 类似于LZW的算法将输入分解为一系列块。 The compressor maintains a dictionary of available blocks (with associated codewords). 压缩器维护可用块的字典(以及相关的代码字)。 Initially, the dictionary contains all single-character strings. 最初,字典包含所有单字符字符串。 It then steps through the input, at each point finding the longest prefix at that point which is in its dictionary. 然后,它逐步遍历输入,在每个点上找到字典中该点处的最长前缀。 Having found that block, it outputs its codeword, and adds to the dictionary the block with the next input character appended. 找到该块后,它将输出其代码字,并将添加了下一个输入字符的块添加到字典中。 (Since the block found was the longest prefix in the dictionary, the block plus the next character cannot be in the dictionary.) It then advances over the block, and continues at the next input point (which is just before the last character of the block it just added to the dictionary). (由于找到的块是词典中最长的前缀,因此该块加上下一个字符不能在词典中。)然后,它越过该块,并继续到下一个输入点(恰好在字典的最后一个字符之前)。阻止它刚添加到字典中)。

Horspool's modification also finds the longest prefix at each point, and also adds that prefix extended by one character into the dictionary. Horspool的修改还找到了每个点上最长的前缀,并将该前缀扩展了一个字符到字典中。 But it does not immediately output that block. 但是它不会立即输出该块。 Instead, it considers prefixes of the greedy match, and for each one works out what the next greedy match would be. 取而代之的是,它考虑贪婪匹配的前缀,并为每个匹配确定下一个贪婪匹配的含义。 That gives it a candidate extent of two blocks; 这使它具有两个块的候选范围; it chooses the extent with the best advance. 它选择最先进的程度。 In order to avoid using up too much time in this search, the algorithm is parameterised by the number of prefixes it will test, on the assumption that much shorter prefixes are unlikely to yield longer extents. 为了避免在此搜索中花费过多的时间,在假设较短的前缀不太可能产生较长范围的前提下,通过将要测试的前缀数量对算法进行参数化。 (And Horspool provides some evidence for this heuristic, although you might want to verify that with your own experimentation.) (尽管您可能想通过自己的实验来验证这一点,但Horspool为这种启发式方法提供了一些证据。)

In Horspool's pseudocode, α is what I call the "candidate match" -- that is, the greedy match found at the previous step -- and β j is the greedy successor match for the input point after the j th prefix of α. 在Horspool的伪代码,α就是我所说的“候选匹配” -也就是贪婪的比赛中发现在前面的步骤-和βj是第jα的前缀之后,贪婪的继任者匹配的输入点。 (Counting from the end, so β 0 is precisely the greedy successor match of α, with the result that setting K to 0 will yield the LZW algorithm. I think Horspool mentions this fact somewhere.) L is just the length of α. (计数从端部,所以β0正是α的贪婪后继的比赛,其结果是设置K为0将产生LZW算法。我想Horspool提到这个事实的某个地方。)L是α的只是长度。 The algorithm will end up using some prefix of α, possibly (usually) all of it. 该算法最终可能会使用(通常)所有α的某个前缀。

Here's Horspool's pseudocode from Figure 2 with my annotations: 这是图2中带我的注释的Horspool伪代码:

initialize dictionary D with all strings of length 1;
set α = the string in D that matches the first
        symbol of the input;
set L = length(α);
while more than L symbols of input remain do
begin
    // The new string α++head(β0) must be added to D here, rather
    // than where Horspool adds it. Otherwise, it is not available for the
    // search for a successor match. Of course, head(β0) is not meaningful here
    // because β0 doesn't exist yet, but it's just the symbol following α in
    // the input.
    for j := 0 to max(L-1,K) do
        // The above should be min(L - 1, K), not max.
        // (Otherwise, K would be almost irrelevant.)
        find βj, the longest string in D that matches
            the input starting L-j symbols ahead;
    add the new string α++head(β0) to D;
    // See above; the new string must be added before the search
    set j = value of j in range 0 to max(L-1,K)
            such that L - j + length(βj) is a maximum;
    // Again, min rather than max
    output the index in D of the string prefix(α,j);
    // Here Horspool forgets that j is the number of characters removed
    // from the end of α, not the number of characters in the desired prefix.
    // So j should be replaced with L - j
    advance j symbols through the input;
    // Again, the advance should be L - j, not j
    set α = βj;
    set L = length(α);
end;
output the index in D of string α;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM