简体   繁体   中英

How is does zip(*) generate n-grams?

I am reviewing some notes on n-grams, and I came accross a couple of interesting functions. First there's this one to generate bigrams:

def bigrams(word):
    return sorted(list(set(''.join(bigram)
                           for bigram in zip(word,word[1:]))))

def bigram_print(word):
    print("The bigrams of", word, "are:")
    print(bigrams(word))

bigram_print("ababa")
bigram_print("babab")

After doing some reading and playing on my own with Python I understand why this works. However, when looking at this function, I am very puzzled by the use of zip(*word[i:]) here. I understand that the * is an unpacking operator (as explained here ), but I really am getting tripped up by how it's working in combination with the list comprehension here. Can anyone explain?

def ngrams(word, n):
    return sorted(list(set(''.join(ngram)
                           for ngram in zip(*[word[i:]
                                              for i in range(n)]))))

def ngram_print(word, n):
    print("The {}-grams of {} are:".format(n, word))
    print(ngrams(word, n))

for n in [2, 3, 4]:
    ngram_print("ababa", n)
    ngram_print("babab", n)
    print()

If you break down

zip(*[word[i:] for i in range(n)])

You get:

[word[i:] for i in range(n)]

Which is equivalent to:

[word[0:], word[1:], word[2:], ... word[n-1:]]

Which are each strings that start from different positions in word

Now, if you apply the unpacking * operator to it:

*[word[0:], word[1:], word[2:], ... word[n-1:]]

You get each of the lists word[0:] , word[1:] etc passed to zip()

So, zip is getting called like this:

zip(word[0:], word[1:], word[2:], ... word[n-1:])

Which - according to how zip works - would create n-tuples, with each entry coming from one of the corresponding arguments:

[(words[0:][0], words[1:][0]....),
(words[0:][1], words[1:][1]....)
...

If you map the indexes, you'll see that these values correspond to the n-gram definitions for word

The following example should explain how this works. I have added code and a visual representation of it.

Intuition

The core idea is to zip together multiple versions of the same list where each of them starts from the next subsequent element.

Lets say L is a list of words/elements ['A', 'B', 'C', 'D']

Then, what's happening here is that L, L[1:], L[2:] get zipped which means the first elements of each of these (which are the 1st, 2nd, and 3rd elements of L) get clubbed together and second elements get clubbed together and so on..

Visually this can be shown as:

在此处输入图像描述

The statement we are worried about -

  zip (   *    [L[i:] for i in range(n)])
#|___||_______||________________________|     
#  |      |                  |
# zip  unpack    versions of L with subsequent 0 to n elements skipped

Code example

l = ['A','B','C','D']

print('original list: '.ljust(27),l)
print('list skipping 1st element: ',l[1:])
print('list skipping 2 elements: '.ljust(27),l[2:])
print('bi-gram: '.ljust(27), list(zip(l,l[1:])))
print('tri-gram: '.ljust(27), list(zip(l,l[1:],l[2:])))
original list:              ['A', 'B', 'C', 'D']
list skipping 1st element:  ['B', 'C', 'D']
list skipping 2 elements:   ['C', 'D']
bi-gram:                    [('A', 'B'), ('B', 'C'), ('C', 'D')]
tri-gram:                   [('A', 'B', 'C'), ('B', 'C', 'D')]

As you can see, you are basically zipping the same list but with one skipped. This zips (A, B) and (B, C)... together for bigrams.

The * operator is for unpacking. When you change the i value to skip elements, you are basically zipping a list of [l[0:], l[1:], l[2:]...] . This is passed to the zip() and unpacked inside it with * .

zip(*[word[i:] for i in range(n)] #where word is the list of words

Alternate to list comprehension

The above list comprehension is equivalent to -

n = 3
lists = []
for i in range(3):
    print(l[i:])        #comment this if not needed
    lists.append(l[i:])
    
out = list(zip(*lists))
print(out)
['A', 'B', 'C', 'D']
['B', 'C', 'D']
['C', 'D']

[('A', 'B', 'C'), ('B', 'C', 'D')]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM