[英]How to use NLTK to find the frequency distribution of specific words in a csv file
I am just starting in python and nltk and trying to read records from a csv file and determine the frequency of specific words across all records. 我只是从python和nltk开始,尝试从csv文件读取记录并确定所有记录中特定单词的出现频率。 I can do something like this: 我可以做这样的事情:
with f:
reader = csv.reader(f)
# Skip the header
next(reader)
for row in reader:
note = row[4]
tokens = [t for t in note.split()]
# Calculate row frequency distribution
freq = nltk.FreqDist(tokens)
for key,val in freq.items():
print (str(key) + ':' + str(val))
# Plot the results
freq.plot(20, cumulative=False)
I am not sure how to modify this so that the frequency is across all records and that only the words that I am interested in are included. 我不确定如何修改它,以便所有记录的频率都在,并且仅包括我感兴趣的单词。 Apologies if this is a really simple question. 抱歉,这是一个非常简单的问题。
You can define the counter outside the loop freq_all = nltk.FreqDist()
, then update it on each row freq_all.update(tokens)
您可以在循环外定义计数器freq_all = nltk.FreqDist()
,然后在每一行上更新它freq_all.update(tokens)
with f:
reader = csv.reader(f)
# Skip the header
next(reader)
freq_all = nltk.FreqDist()
for row in reader:
note = row[4]
tokens = [t for t in note.split()]
# Calculate raw frequency distribution
freq = nltk.FreqDist(tokens)
freq_all.update(tokens)
for key,val in freq.items():
print (str(key) + ':' + str(val))
# Plot the results
freq.plot(20, cumulative=False)
# Plot the overall results
freq_all.plot(20, cumulative=False)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.