[英]Python -need help understanding the difference between these two pieces of code
以下两段代码之间有什么区别:
distances = ((jaccard_distance(set(nltk.ngrams(entry, gram_number)),
set(nltk.ngrams(word, gram_number))), word)
for word in spellings)
和
for word in spellings:
distances = ((jaccard_distance(set(nltk.ngrams(entry, gram_number)),
set(nltk.ngrams(word, gram_number))), word))
到底有什么区别? 在此先感谢您的帮助
获取2袋单词之间的Jaccard距离,即2句话的独特词汇。
>>> from nltk.metrics import jaccard_distance
>>> from nltk import ngrams
>>> sent1 = "This is a foo bar sentence".split()
>>> sent2 = "A bar bar black sheep have you a sentence".split()
>>> set(sent1) # A list of unique words in sent1
set(['a', 'bar', 'sentence', 'This', 'is', 'foo'])
>>> set(sent2) # A list of unique words in sent2
set(['A', 'sheep', 'bar', 'sentence', 'black', 'a', 'have', 'you'])
>>> jaccard_distance(set(sent1), set(sent2))
0.7272727272727273
现在,如果这是ngram包:
>>> list(ngrams(sent2, 3)) # list of tri-grams in sent2.
[('A', 'bar', 'bar'), ('bar', 'bar', 'black'), ('bar', 'black', 'sheep'), ('black', 'sheep', 'have'), ('sheep', 'have', 'you'), ('have', 'you', 'a'), ('you', 'a', 'sentence')]
>>> set(list(ngrams(sent2, 3))) # unique set of tri-grams in sent2.
set([('A', 'bar', 'bar'), ('have', 'you', 'a'), ('you', 'a', 'sentence'), ('sheep', 'have', 'you'), ('black', 'sheep', 'have'), ('bar', 'black', 'sheep'), ('bar', 'bar', 'black')])
>>> set(ngrams(sent2, 3))
set([('A', 'bar', 'bar'), ('have', 'you', 'a'), ('you', 'a', 'sentence'), ('sheep', 'have', 'you'), ('black', 'sheep', 'have'), ('bar', 'black', 'sheep'), ('bar', 'bar', 'black')])
>>> set(ngrams(sent1, 3))
set([('This', 'is', 'a'), ('a', 'foo', 'bar'), ('is', 'a', 'foo'), ('foo', 'bar', 'sentence')])
>>> jaccard_distance(set(ngrams(sent1,3)), set(ngrams(sent2, 3)))
1.0
Jaccard距离1.0是什么意思?
这意味着比较中的两个序列完全不同,在这种情况下,每个句子都有唯一的一组ngram。
以前,我们将一个句子字符串分成字符串列表,当我们比较2个序列时,它们将比较句子中的单词/词组。
现在,如果我们迭代2个单词而不是句子,则将单词分为一个字符列表,即
>>> word1 = 'Supercalifragilisticexpialidocious'
>>> word2 = 'Honorificabilitudinitatibus'
>>> list(word1) # The list of characters in the word
['S', 'u', 'p', 'e', 'r', 'c', 'a', 'l', 'i', 'f', 'r', 'a', 'g', 'i', 'l', 'i', 's', 't', 'i', 'c', 'e', 'x', 'p', 'i', 'a', 'l', 'i', 'd', 'o', 'c', 'i', 'o', 'u', 's']
>>> set(list(word1)) # The set of unique characters in the word
set(['a', 'c', 'e', 'd', 'g', 'f', 'i', 's', 'l', 'o', 'p', 'S', 'r', 'u', 't', 'x'])
>>> set(ngrams(word1, 3)) # The set of unique character trigrams in the word.
set([('c', 'a', 'l'), ('S', 'u', 'p'), ('t', 'i', 'c'), ('d', 'o', 'c'), ('f', 'r', 'a'), ('i', 'f', 'r'), ('r', 'a', 'g'), ('i', 's', 't'), ('s', 't', 'i'), ('x', 'p', 'i'), ('u', 'p', 'e'), ('o', 'u', 's'), ('i', 'c', 'e'), ('l', 'i', 'f'), ('p', 'e', 'r'), ('o', 'c', 'i'), ('g', 'i', 'l'), ('l', 'i', 'd'), ('i', 'l', 'i'), ('c', 'i', 'o'), ('r', 'c', 'a'), ('l', 'i', 's'), ('a', 'g', 'i'), ('p', 'i', 'a'), ('i', 'o', 'u'), ('e', 'x', 'p'), ('i', 'a', 'l'), ('c', 'e', 'x'), ('a', 'l', 'i'), ('i', 'd', 'o'), ('e', 'r', 'c')])
并获得它们之间的Jaccard距离:
>>> jaccard_distance(set(ngrams(word1, 3)), set(ngrams(word2, 3)))
0.9818181818181818
现在到OP的问题:
distances = ((jaccard_distance(set(nltk.ngrams(entry, gram_number)),
set(nltk.ngrams(word, gram_number))), word)
for word in spellings)
与
for word in spellings:
distances = ((jaccard_distance(set(nltk.ngrams(entry, gram_number)),
set(nltk.ngrams(word, gram_number))), word))
您可以尝试做的第一件事就是简化代码:
不必一次又一次输入nltk.ngrams(...)
,您可以这样做:
>>> from nltk import ngrams
>>> list(ngrams('foobar', 3))
[('f', 'o', 'o'), ('o', 'o', 'b'), ('o', 'b', 'a'), ('b', 'a', 'r')]
而且,如果您仅使用2或3的n-gram顺序,即双字母组或三字母组,则可以执行以下操作:
>>> from nltk import bigrams, trigrams
>>> list(bigrams('foobar'))
[('f', 'o'), ('o', 'o'), ('o', 'b'), ('b', 'a'), ('a', 'r')]
>>> list(trigrams('foobar'))
[('f', 'o', 'o'), ('o', 'o', 'b'), ('o', 'b', 'a'), ('b', 'a', 'r')]
而且,如果您想花大价钱并为所需的ngram顺序创建自定义函数,可以尝试functools.partial
:
>>> from functools import partial
>>> from nltk import ngrams
>>> octagram = partial(ngrams, n=8)
>>> word = 'Supercalifragilisticexpialidocious'
>>> octagram(word)
<generator object ngrams at 0x10cafff00>
>>> list(octagram(word))
[('S', 'u', 'p', 'e', 'r', 'c', 'a', 'l'), ('u', 'p', 'e', 'r', 'c', 'a', 'l', 'i'), ('p', 'e', 'r', 'c', 'a', 'l', 'i', 'f'), ('e', 'r', 'c', 'a', 'l', 'i', 'f', 'r'), ('r', 'c', 'a', 'l', 'i', 'f', 'r', 'a'), ('c', 'a', 'l', 'i', 'f', 'r', 'a', 'g'), ('a', 'l', 'i', 'f', 'r', 'a', 'g', 'i'), ('l', 'i', 'f', 'r', 'a', 'g', 'i', 'l'), ('i', 'f', 'r', 'a', 'g', 'i', 'l', 'i'), ('f', 'r', 'a', 'g', 'i', 'l', 'i', 's'), ('r', 'a', 'g', 'i', 'l', 'i', 's', 't'), ('a', 'g', 'i', 'l', 'i', 's', 't', 'i'), ('g', 'i', 'l', 'i', 's', 't', 'i', 'c'), ('i', 'l', 'i', 's', 't', 'i', 'c', 'e'), ('l', 'i', 's', 't', 'i', 'c', 'e', 'x'), ('i', 's', 't', 'i', 'c', 'e', 'x', 'p'), ('s', 't', 'i', 'c', 'e', 'x', 'p', 'i'), ('t', 'i', 'c', 'e', 'x', 'p', 'i', 'a'), ('i', 'c', 'e', 'x', 'p', 'i', 'a', 'l'), ('c', 'e', 'x', 'p', 'i', 'a', 'l', 'i'), ('e', 'x', 'p', 'i', 'a', 'l', 'i', 'd'), ('x', 'p', 'i', 'a', 'l', 'i', 'd', 'o'), ('p', 'i', 'a', 'l', 'i', 'd', 'o', 'c'), ('i', 'a', 'l', 'i', 'd', 'o', 'c', 'i'), ('a', 'l', 'i', 'd', 'o', 'c', 'i', 'o'), ('l', 'i', 'd', 'o', 'c', 'i', 'o', 'u'), ('i', 'd', 'o', 'c', 'i', 'o', 'u', 's')]
不用重写set(nltk.ngrams(word, gram_number))
,而是得到uco(word)
:
>>> from nltk import ngrams
>>> def unique_character_octagrams(text, n=8):
... return set(ngrams(text, n))
...
>>> uco = unique_character_octagrams
>>> uco(word1)
set([('e', 'x', 'p', 'i', 'a', 'l', 'i', 'd'), ('S', 'u', 'p', 'e', 'r', 'c', 'a', 'l'), ('i', 'c', 'e', 'x', 'p', 'i', 'a', 'l'), ('a', 'g', 'i', 'l', 'i', 's', 't', 'i'), ('t', 'i', 'c', 'e', 'x', 'p', 'i', 'a'), ('i', 'l', 'i', 's', 't', 'i', 'c', 'e'), ('i', 'd', 'o', 'c', 'i', 'o', 'u', 's'), ('c', 'e', 'x', 'p', 'i', 'a', 'l', 'i'), ('l', 'i', 's', 't', 'i', 'c', 'e', 'x'), ('f', 'r', 'a', 'g', 'i', 'l', 'i', 's'), ('l', 'i', 'f', 'r', 'a', 'g', 'i', 'l'), ('i', 'f', 'r', 'a', 'g', 'i', 'l', 'i'), ('p', 'i', 'a', 'l', 'i', 'd', 'o', 'c'), ('a', 'l', 'i', 'f', 'r', 'a', 'g', 'i'), ('x', 'p', 'i', 'a', 'l', 'i', 'd', 'o'), ('e', 'r', 'c', 'a', 'l', 'i', 'f', 'r'), ('l', 'i', 'd', 'o', 'c', 'i', 'o', 'u'), ('g', 'i', 'l', 'i', 's', 't', 'i', 'c'), ('i', 's', 't', 'i', 'c', 'e', 'x', 'p'), ('r', 'c', 'a', 'l', 'i', 'f', 'r', 'a'), ('r', 'a', 'g', 'i', 'l', 'i', 's', 't'), ('i', 'a', 'l', 'i', 'd', 'o', 'c', 'i'), ('p', 'e', 'r', 'c', 'a', 'l', 'i', 'f'), ('a', 'l', 'i', 'd', 'o', 'c', 'i', 'o'), ('u', 'p', 'e', 'r', 'c', 'a', 'l', 'i'), ('c', 'a', 'l', 'i', 'f', 'r', 'a', 'g'), ('s', 't', 'i', 'c', 'e', 'x', 'p', 'i')])
在OP中,您曾经for word in spellings
来迭代这些拼写,但尚不清楚什么是spellings
。 最好在OP中有spellings
输入的示例,这样回答者无需在黑暗中猜测spellings
到底是什么。
从循环和Jaccard距离的用法来看, spellings
看起来像是一个单词列表,因此更好的变量名将是list_of_words
并且在没有注释的情况下迭代会更清晰,例如for word in list_of_words
。
另外, entry
变量也是模棱两可的,从用法上来说,它很可能是您要在单词列表上执行的查询,因此可能的变量名称为query_word
。
def unique_character_trigrams(text, n=3):
return set(ngrams(text, n))
uct = unique_character_trigrams
list_of_words = ['Supercalifragilisticexpialidocious', 'Honorificabilitudinitatibus']
query_word = 'Antidisestablishmentarianism'
for word in list_of_words:
d = jaccard_distance(uct(query_word), uct(word))
print("Comparing {} vs {}\nJaccard = {}\n".format(query_word, word, d))
[出]:
Comparing Antidisestablishmentarianism vs Supercalifragilisticexpialidocious
Jaccard = 0.982142857143
Comparing Antidisestablishmentarianism vs Honorificabilitudinitatibus
Jaccard = 1.0
现在,真正回到OP问题。 让我们来对待:
spelling
为x
,即数字列表 entry
为y
,即静态数 word
为num
,即数字列表中的一个数字 jaccard_distance
为f
,一个简单的减法函数。 如果是第一种情况,则循环序列内联的这种语法是list comprehension 。 输出是生成器类型,您必须使用list
实例化生成器,并且在生成器内部,每个元素都是f
的输出:
>>> x = [10, 20, 30] # A list of numbers.
>>> y = 3 # A number to compare against the list.
>>> f = lambda x, y: x - y # A simple function to do x - y
>>> f(10,3)
7
>>> f(20,3)
17
>>> result = (f(num,y) for num in x)
>>> result
<generator object <genexpr> at 0x10cafff00>
>>> list(result)
[7, 17, 27]
在第二种情况下,这是更传统的迭代方式,您在循环的每次迭代中都得到一个整数输出:
>>> for num in x:
... result = f(num, y)
... print(type(result), result)
...
(<type 'int'>, 7)
(<type 'int'>, 17)
(<type 'int'>, 27)
在情况1中 :
距离是一个元组,包含拼写中所有单词的值,例如:
(0.1111111111111111, 'hello')
(0.2222222222222222, 'world')
(0.5, 'program')
(0.2727272727272727, 'computer')
(0.0, 'spell')
在情况2中 :
距离被覆盖,因此距离将仅包含最后一个值
(0.0, 'spell')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.