[英]NLTK count frequency of sub phrase
For this sentence: "I see a tall tree outside. A man is under the tall tree" 对于这句话:“我看到外面有一棵大树。一个男人在那棵大树下”
How do I count the frequency of tall tree
? 我如何计算
tall tree
的频率? I can get use a bigram in collocation, such as 我可以在配置中使用bigram,例如
bgs= nltk.bigrams(tokens)
fdist1= nltk.FreqDist(bgs)
pairs = fdist1.most_common(500)
but all I need is to count a specific sub phrase. 但我所需要的只是数一个特定的副词。
@uday1889's answer has some flaws: @ uday1889的答案有一些缺陷:
>>> string = "I see a tall tree outside. A man is under the tall tree"
>>> string.count("tall tree")
2
>>> string = "The see a stall tree outside. A man is under the tall trees"
>>> string.count("tall tree")
2
>>> string = "I would like to install treehouses at my yard"
>>> string.count("tall tree")
1
A cheap hack would be to pad in the space in the str.count()
: 一个便宜的技巧是在
str.count()
填充空间:
>>> string = "I would like to install treehouses at my yard"
>>> string.count("tall tree")
1
>>> string.count(" tall tree ")
0
>>> string = "The see a stall tree outside. A man is under the tall trees"
>>> string.count(" tall tree ")
0
>>> string = "I see a tall tree outside. A man is under the tall tree"
>>> string.count(" tall tree ")
1
But as you see there's some problems when the substring is at the start or end of a sentence or next to a punctuation. 但是如您所见,当子字符串位于句子的开头或结尾或标点符号旁边时,会出现一些问题。
>>> from nltk.util import ngrams
>>> from nltk import word_tokenize
>>> string = "I see a tall tree outside. A man is under the tall tree"
>>> len([i for i in ngrams(word_tokenize(string),n=2) if i==('tall', 'tree')])
2
>>> string = "I would like to install treehouses at my yard"
>>> len([i for i in ngrams(word_tokenize(string),n=2) if i==('tall', 'tree')])
0
The count() method should do it: count()方法应该做到这一点:
string = "I see a tall tree outside. A man is under the tall tree"
string.count("tall tree")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.