在計算文本中單詞准確性的頻率時，如何忽略某些單詞？

Question

在計算文本中單詞准確性的頻率時，如何忽略諸如“ a”，“ the”之類的單詞？

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

df= pd.DataFrame({'phrase': pd.Series('The large distance between cities. The small distance. The')})
f = CountVectorizer().build_tokenizer()(str(df['phrase']))

result = collections.Counter(f).most_common(1)

print result

答案將是。 但我想把距離當作最常用的詞。

Answer 1

最好避免像這樣開始計數條目。

ignore = {'the','a','if','in','it','of','or'}
result = collections.Counter(x for x in f if x not in ignore).most_common(1)

Answer 2

另一種選擇是使用stop_words的參數CountVectorizer 。
這些是您不感興趣的詞，分析器將舍棄這些詞。

f = CountVectorizer(stop_words={'the','a','if','in','it','of','or'}).build_analyzer()(str(df['phrase']))
result = collections.Counter(f).most_common(1)
print result
[(u'distance', 1)]

請注意， tokenizer器不會執行預處理（小寫，重音符號去除）或刪除停用詞，因此您需要在此處使用分析器。

您還可以使用stop_words='english'自動刪除英語停用詞（完整列表請參見sklearn.feature_extraction.text.ENGLISH_STOP_WORDS ）。

在計算文本中單詞准確性的頻率時，如何忽略某些單詞？

問題描述

2 個解決方案

解決方案1
5 已采納 2015-09-24 18:34:03

解決方案2
3 2015-09-25 10:07:43

在計算文本中單詞准確性的頻率時，如何忽略某些單詞？

問題描述

2 個解決方案

解決方案1 5 已采納 2015-09-24 18:34:03

解決方案2 3 2015-09-25 10:07:43

解決方案1
5 已采納 2015-09-24 18:34:03

解決方案2
3 2015-09-25 10:07:43