I am reviewing some notes on n-grams, and I came accross a couple of interesting functions. First there's this one to generate bigrams:
def bigrams(word):
return sorted(list(set(''.join(bigram)
for bigram in zip(word,word[1:]))))
def bigram_print(word):
print("The bigrams of", word, "are:")
print(bigrams(word))
bigram_print("ababa")
bigram_print("babab")
After doing some reading and playing on my own with Python I understand why this works. However, when looking at this function, I am very puzzled by the use of zip(*word[i:])
here. I understand that the *
is an unpacking operator (as explained here ), but I really am getting tripped up by how it's working in combination with the list comprehension here. Can anyone explain?
def ngrams(word, n):
return sorted(list(set(''.join(ngram)
for ngram in zip(*[word[i:]
for i in range(n)]))))
def ngram_print(word, n):
print("The {}-grams of {} are:".format(n, word))
print(ngrams(word, n))
for n in [2, 3, 4]:
ngram_print("ababa", n)
ngram_print("babab", n)
print()
If you break down
zip(*[word[i:] for i in range(n)])
You get:
[word[i:] for i in range(n)]
Which is equivalent to:
[word[0:], word[1:], word[2:], ... word[n-1:]]
Which are each strings that start from different positions in word
Now, if you apply the unpacking *
operator to it:
*[word[0:], word[1:], word[2:], ... word[n-1:]]
You get each of the lists word[0:]
, word[1:]
etc passed to zip()
So, zip
is getting called like this:
zip(word[0:], word[1:], word[2:], ... word[n-1:])
Which - according to how zip
works - would create n-tuples, with each entry coming from one of the corresponding arguments:
[(words[0:][0], words[1:][0]....),
(words[0:][1], words[1:][1]....)
...
If you map the indexes, you'll see that these values correspond to the n-gram definitions for word
The following example should explain how this works. I have added code and a visual representation of it.
The core idea is to zip together multiple versions of the same list where each of them starts from the next subsequent element.
Lets say L
is a list of words/elements ['A', 'B', 'C', 'D']
Then, what's happening here is that L, L[1:], L[2:]
get zipped which means the first elements of each of these (which are the 1st, 2nd, and 3rd elements of L) get clubbed together and second elements get clubbed together and so on..
Visually this can be shown as:
The statement we are worried about -
zip ( * [L[i:] for i in range(n)])
#|___||_______||________________________|
# | | |
# zip unpack versions of L with subsequent 0 to n elements skipped
l = ['A','B','C','D']
print('original list: '.ljust(27),l)
print('list skipping 1st element: ',l[1:])
print('list skipping 2 elements: '.ljust(27),l[2:])
print('bi-gram: '.ljust(27), list(zip(l,l[1:])))
print('tri-gram: '.ljust(27), list(zip(l,l[1:],l[2:])))
original list: ['A', 'B', 'C', 'D']
list skipping 1st element: ['B', 'C', 'D']
list skipping 2 elements: ['C', 'D']
bi-gram: [('A', 'B'), ('B', 'C'), ('C', 'D')]
tri-gram: [('A', 'B', 'C'), ('B', 'C', 'D')]
As you can see, you are basically zipping the same list but with one skipped. This zips (A, B) and (B, C)... together for bigrams.
The *
operator is for unpacking. When you change the i value to skip elements, you are basically zipping a list of [l[0:], l[1:], l[2:]...]
. This is passed to the zip()
and unpacked inside it with *
.
zip(*[word[i:] for i in range(n)] #where word is the list of words
The above list comprehension is equivalent to -
n = 3
lists = []
for i in range(3):
print(l[i:]) #comment this if not needed
lists.append(l[i:])
out = list(zip(*lists))
print(out)
['A', 'B', 'C', 'D']
['B', 'C', 'D']
['C', 'D']
[('A', 'B', 'C'), ('B', 'C', 'D')]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.