简体   繁体   English

如何使用 nltk 计算文本中存在的单词的频率

[英]How to count the frequency of words existing in a text using nltk

I have a python script that reads the text and applies preprocess functions in order to do the analysis.我有一个 python 脚本,它读取文本并应用预处理函数来进行分析。
The problem is that I want to count the frequency of words but the system crash and displays the below error.问题是我想计算单词的频率,但系统崩溃并显示以下错误。

File "F:\\AIenv\\textAnalysis\\setup.py", line 208, in tag_and_save file.write(word+"/"+tag+" (frequency="+str(freq_tagged_data[word])+")\\n") TypeError: tuple indices must be integers or slices, not str文件 "F:\\AIenv\\textAnalysis\\setup.py", line 208, in tag_and_save file.write(word+"/"+tag+" (frequency="+str(freq_tagged_data[word])+")\\n") TypeError : 元组索引必须是整数或切片,而不是 str

I am trying to count the frequency and then write to a text file .我正在尝试计算频率,然后写入text file

def get_freq(tagged):
    freqs = FreqDist(tagged)
    for word, freq in freqs.items():
        print(word, freq)
    result = word,freq
    return result

def tag_and_save(tagger,text,path):
    clt = clean_text(text)
    tagged_data = tagger.tag(clt)

    freq_tagged_data = get_freq(tagged_data)
    file = open(path,"w",encoding = "UTF8")
    for word,tag in tagged_data:
        file.write(word+"/"+tag+" (frequency="+str(freq_tagged_data[word])+")\n")
    file.close()

I expect the output like this :我希望输出是这样的:

('*****/DTNN') 3 ('*****/DTNN') 3


based on the answer of基于答案

i changed the function get_freq() into :我将函数get_freq()更改为:

def get_freq(tagged):
    freq_dist = {}
    freqs = FreqDist(tagged)
    freq_dist = [(word, freq) for word ,freq in freqs.items()]
    return freq_dist

but now it display the below error :但现在它显示以下错误:

File "F:\\AIenv\\textAnalysis\\setup.py", line 217, in tag_and_save file.write(word+"/"+tag+" (frequency="+str(freq_tagged_data[word])+")\\n")文件 "F:\\AIenv\\textAnalysis\\setup.py", line 217, in tag_and_save file.write(word+"/"+tag+" (frequency="+str(freq_tagged_data[word])+")\\n")

TypeError: list类型错误:列表

indices must be integers or slices, not str索引必须是整数或切片,而不是 str

How to fix this error and what should I do?如何修复此错误,我该怎么办?

Maybe this might help.也许这可能会有所帮助。

import nltk
text = "An an valley indeed so no wonder future nature vanity. Debating all she mistaken indulged believed provided declared. He many kept on draw lain song as same. Whether at dearest certain spirits is entered in to. Rich fine bred real use too many good. She compliment unaffected expression favourable any. Unknown chiefly showing to conduct no."
tokens = [t for t in text.split()]
freqs = nltk.FreqDist(tokens)
blah_list = [(k, v) for k, v in freqs.items()]
print(blah_list)

This snippet counts the word frequency.此代码段计算词频。

Edit: Code is now working.编辑:代码现在正在运行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM