简体   繁体   English

zip(*) 如何生成 n-gram?

[英]How is does zip(*) generate n-grams?

I am reviewing some notes on n-grams, and I came accross a couple of interesting functions.我正在复习一些关于 n-gram 的笔记,并且遇到了一些有趣的函数。 First there's this one to generate bigrams:首先有这个生成二元组:

def bigrams(word):
    return sorted(list(set(''.join(bigram)
                           for bigram in zip(word,word[1:]))))

def bigram_print(word):
    print("The bigrams of", word, "are:")
    print(bigrams(word))

bigram_print("ababa")
bigram_print("babab")

After doing some reading and playing on my own with Python I understand why this works.在使用 Python 自己阅读和玩耍之后,我明白为什么会这样了。 However, when looking at this function, I am very puzzled by the use of zip(*word[i:]) here.但是,当看到这个 function 时,我对这里使用的zip(*word[i:])感到非常困惑。 I understand that the * is an unpacking operator (as explained here ), but I really am getting tripped up by how it's working in combination with the list comprehension here.我知道*是一个解包运算符(如此处所述),但我真的被它与此处的列表理解结合起来的工作方式所困扰。 Can anyone explain?谁能解释一下?

def ngrams(word, n):
    return sorted(list(set(''.join(ngram)
                           for ngram in zip(*[word[i:]
                                              for i in range(n)]))))

def ngram_print(word, n):
    print("The {}-grams of {} are:".format(n, word))
    print(ngrams(word, n))

for n in [2, 3, 4]:
    ngram_print("ababa", n)
    ngram_print("babab", n)
    print()

If you break down如果你崩溃了

zip(*[word[i:] for i in range(n)])

You get:你得到:

[word[i:] for i in range(n)]

Which is equivalent to:这相当于:

[word[0:], word[1:], word[2:], ... word[n-1:]]

Which are each strings that start from different positions in word哪些是从word中不同位置开始的每个字符串

Now, if you apply the unpacking * operator to it:现在,如果您将解包*运算符应用于它:

*[word[0:], word[1:], word[2:], ... word[n-1:]]

You get each of the lists word[0:] , word[1:] etc passed to zip()你得到每个列表word[0:]word[1:]等传递给zip()

So, zip is getting called like this:因此, zip被这样调用:

zip(word[0:], word[1:], word[2:], ... word[n-1:])

Which - according to how zip works - would create n-tuples, with each entry coming from one of the corresponding arguments:其中 - 根据zip工作方式 - 将创建 n 元组,每个条目来自相应的 arguments 之一:

[(words[0:][0], words[1:][0]....),
(words[0:][1], words[1:][1]....)
...

If you map the indexes, you'll see that these values correspond to the n-gram definitions for word如果您使用 map 索引,您会看到这些值对应于word的 n-gram 定义

The following example should explain how this works.下面的例子应该解释它是如何工作的。 I have added code and a visual representation of it.我添加了代码和它的可视化表示。

Intuition直觉

The core idea is to zip together multiple versions of the same list where each of them starts from the next subsequent element.核心思想是将 zip 放在同一个列表的多个版本中,其中每个版本都从下一个后续元素开始。

Lets say L is a list of words/elements ['A', 'B', 'C', 'D']假设L是单词/元素列表['A', 'B', 'C', 'D']

Then, what's happening here is that L, L[1:], L[2:] get zipped which means the first elements of each of these (which are the 1st, 2nd, and 3rd elements of L) get clubbed together and second elements get clubbed together and so on..然后,这里发生的是L, L[1:], L[2:]被压缩,这意味着它们中的每一个的第一个元素(即 L 的第一个、第二个和第三个元素)被合并在一起,第二个元素组合在一起等等..

Visually this can be shown as:在视觉上这可以显示为:

在此处输入图像描述

The statement we are worried about -我们担心的声明——

  zip (   *    [L[i:] for i in range(n)])
#|___||_______||________________________|     
#  |      |                  |
# zip  unpack    versions of L with subsequent 0 to n elements skipped

Code example代码示例

l = ['A','B','C','D']

print('original list: '.ljust(27),l)
print('list skipping 1st element: ',l[1:])
print('list skipping 2 elements: '.ljust(27),l[2:])
print('bi-gram: '.ljust(27), list(zip(l,l[1:])))
print('tri-gram: '.ljust(27), list(zip(l,l[1:],l[2:])))
original list:              ['A', 'B', 'C', 'D']
list skipping 1st element:  ['B', 'C', 'D']
list skipping 2 elements:   ['C', 'D']
bi-gram:                    [('A', 'B'), ('B', 'C'), ('C', 'D')]
tri-gram:                   [('A', 'B', 'C'), ('B', 'C', 'D')]

As you can see, you are basically zipping the same list but with one skipped.如您所见,您基本上压缩了相同的列表,但跳过了一个。 This zips (A, B) and (B, C)... together for bigrams.这将 (A, B) 和 (B, C)... 压缩在一起以形成二元组。

The * operator is for unpacking. *运算符用于解包。 When you change the i value to skip elements, you are basically zipping a list of [l[0:], l[1:], l[2:]...] .当您将 i 值更改为跳过元素时,您基本上是在压缩[l[0:], l[1:], l[2:]...]的列表。 This is passed to the zip() and unpacked inside it with * .这被传递给zip()并用*解压在其中。

zip(*[word[i:] for i in range(n)] #where word is the list of words

Alternate to list comprehension替代列表理解

The above list comprehension is equivalent to -上面的列表理解等价于 -

n = 3
lists = []
for i in range(3):
    print(l[i:])        #comment this if not needed
    lists.append(l[i:])
    
out = list(zip(*lists))
print(out)
['A', 'B', 'C', 'D']
['B', 'C', 'D']
['C', 'D']

[('A', 'B', 'C'), ('B', 'C', 'D')]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM