如何在 python 中使用 nltk 找到特定的二元语法？

Question

I am currently working with nltk.book iny Python and would like to find the frequency of a specific bigram.我目前正在使用 nltk.book iny Python 并想找到特定二元组的频率。 I know there is the bigram() function that gives you the most common bigrams in the text as in this code:我知道有 bigram() function 可以为您提供文本中最常见的二元语法，如以下代码所示：

    >>> list(bigrams(['more', 'is', 'said', 'than', 'done']))
    [('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
    >>>

But what if I was searching for only a specific one like "wish for"?但是如果我只搜索一个特定的词，比如“wish for”呢？ I couldn't find anything about that in the nltk documentation so far.到目前为止，我在 nltk 文档中找不到任何相关信息。

Answer 1

If you can return a list of tuples, you can use in :如果可以返回元组列表，则可以使用in ：

>>> bgrms = [('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
>>> ('more', 'is') in bgrms
True
>>> ('wish', 'for') in bgrms
False

Then if you're looking for the frequency of specific bigrams, it might be helpful to build a Counter:然后，如果您正在寻找特定二元组的频率，构建一个计数器可能会有所帮助：

from nltk import bigrams
from collections import Counter

bgrms = list(bigrams(['more', 'is', 'said', 'than', 'wish', 'for', 'wish', 'for']))

bgrm_counter = Counter(bgrms)

# Query the Counter collection for a specific frequency:
print(
  bgrm_counter.get(tuple(["wish", "for"]))
)

Output: Output：

Finally, if you want to understand this frequency in terms of how many bigrams are possible, you could divide by the number of possible bigrams:最后，如果你想根据可能的二元组数来理解这个频率，你可以除以可能的二元组数：

# Divide by the length of `bgrms`

print(
  bgrm_counter.get(tuple(["wish", "for"])) / len(bgrms)
)

Output: Output：

0.2857142857142857

如何在 python 中使用 nltk 找到特定的二元语法？

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-11-14 15:17:38

如何在 python 中使用 nltk 找到特定的二元语法？

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-11-14 15:17:38

解决方案1
0 已采纳 2020-11-14 15:17:38