简体   繁体   English

删除 Python 上的停用词和其他任务

[英]Removing stopwords and other tasks on Python

So, I have been given a .txt file (name:newtext) containing a novel and a .txt file (name:stopwords) containing a list of stopwords and I have to work on these two (without importing any other processing tools such as NLTK etc.) and I need to perform these tasks:因此,我得到了一个包含小说的.txt文件(名称:newtext)和一个包含停用词列表的 .txt 文件(名称:stopwords),我必须处理这两个文件(无需导入任何其他处理工具,例如NLTK 等),我需要执行以下任务:

  • identify words识别单词
  • remove stop words using the list of “stop words”使用“停用词”列表删除停用词
  • determine frequencies of occurrences for each word after the stop-word removal确定去除停用词后每个词的出现频率
  • Print out the top ten most frequent ones, together with their frequency counts打印前十个最常见的,以及它们的频率计数

I am really lost.我真的迷路了。

I'll give you some hints:我会给你一些提示:

  1. When you have opened your txt file with the text, you can iterate over the text word for word.当您打开带有文本的 txt 文件时,您可以逐字遍历文本。
  2. You can store your stop words which you also have read from the stop_word.txt file in a list.您可以将您也从 stop_word.txt 文件中读取的停用词存储在列表中。
  3. During the iteration of words, if the word you are currently looking at is not a stop word, save this word to a new string.在词的迭代过程中,如果您当前正在查看的词不是停用词,则将该词保存到新字符串中。 (This removes all stop words). (这会删除所有停用词)。
  4. After you have created your new string without stopwords (dont forget to add a space between words your adding to the new string) you can do like this to split the new string and count all occurances.创建没有停用词的新字符串后(不要忘记在添加到新字符串的单词之间添加空格),您可以这样做来拆分新字符串并计算所有出现次数。

List item项目清单

for word in new_words.split(" "):
    if not word_count.get(word, False):
        word_count[word] = 1
    else:
        word_count[word] += 1

for word in word_count.keys():
    print(f"Number of occurances of {word} was {word_count[word]}.")

Any how, I thought of adding an answer for this.无论如何,我想为此添加一个答案。 Its a bare logic, which I did not test.这是一个简单的逻辑,我没有测试。 Hopefully it will become handy!希望它会变得方便!

novel = open("newtext.txt", "r").read()
s_words = open("stopwords.txt", "r").readlines()
s_words = [x.strip() for x in s_words]

# identify all words in the novel
all_words = novel.split(" ")

# remove stop words using the list of “stop words”
no_stop_words = [x for x in all_words if x not in s_words]

# determine frequencies of occurrences for each word after the stop-word removal
frequencies  = {word: no_stop_words.count(word) for word in no_stop_words}

# Print out the top ten most frequent ones, together with their frequency counts
for word, frequency in sorted(frequencies.items(), key=lambda x: x[1], reverse=True)[:10]:
    print(word, frequency)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM