简体   繁体   English

从另一个文件中的文本中删除文件中定义的所有停用词(Python)

[英]Removing all stopwords defined in a file from a text in another file (Python)

I have two text files:我有两个文本文件:

  1. Stopwords.txt --> contains stop words one per line Stopwords.txt --> 每行包含一个停用词
  2. text.txt --> big document file text.txt --> 大文档文件

I'm trying to remove all occurences of stopwords (any word in the stopwords.txt file) from the text.txt file without using NLTK (school assignment).我正在尝试在不使用 NLTK (学校作业)的情况下从 text.txt 文件中删除所有出现的停用词(stopwords.txt 文件中的任何单词)。

How would I go about doing this?我将如何 go 这样做? This is my code so far.到目前为止,这是我的代码。

import re

with open('text.txt', 'r') as f, open('stopwords.txt','r') as st:
    f_content = f.read()
    #splitting text.txt by non alphanumeric characters
    processed = re.split('[^a-zA-Z]', f_content)

    st_content = st.read()
    #splitting stopwords.txt by new line
    st_list = re.split('\n', st_content)
    #print(st_list) to check it was working

    #what I'm trying to do is: traverse through the text. If stopword appears, 
    #remove it. otherwise keep it. 
    for word in st_list:
        f_content = f_content.replace(word, "")
        print(f_content) 

but when I run the code, it first takes forever to output something and when it does it just outputs the entire text file.但是当我运行代码时,它首先需要永远到 output 一些东西,当它执行时它只输出整个文本文件。 (I'm new to python so let me know if I'm doing something fundamentally wrong!) (我是 python 的新手,所以如果我做错了什么,请告诉我!)

Here is what I use when I need to remove English stop words.这是我需要删除英语停用词时使用的方法。 I usually also use the corpus from nltk instead of my own file for stop words.我通常也使用来自 nltk 的语料库而不是我自己的文件作为停用词。

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
ps = PorterStemmer()

## Remove stop words
stops = set(stopwords.words("english"))
text = [ps.stem(w) for w in text if not w in stops and len(w) >= 3]
text = list(set(text)) #remove duplicates
text = " ".join(text)

For your special case I would do something like:对于您的特殊情况,我会执行以下操作:

stops = list_of_words_from_file

Let me know if I answered your question, I am not sure if the problem is the read from file or the stemming.如果我回答了您的问题,请告诉我,我不确定问题是从文件读取还是词干提取。

Edit: To remove all stopwords defined in a file from a text in another file, we can use str.replace()编辑:要从另一个文件的文本中删除文件中定义的所有停用词,我们可以使用 str.replace()

for word in st_list:
    f_content=f_content.replace(word)

Based on the fact you are facing performance issues.基于您面临性能问题的事实。 I would suggest using the subprocess library ( for Python2 , or for Python3 ) to call the sed linux command.我建议使用subprocess进程库(用于 Python2Python3 )调用sed linux 命令。

I know Python is really good for this kind of thing (and many others), but if you have a really big text.txt.我知道 Python 非常适合这种事情(以及许多其他事情),但如果你有一个非常大的 text.txt。 I would try the old, ugly, and powerful command-line 'sed'.我会尝试旧的、丑陋的、功能强大的命令行“sed”。

Try something like:尝试类似:

sed -f stopwords.sed text.txt > output_file.txt sed -f stopwords.sed text.txt > output_file.txt

For the stopwords.sed file, each stopword must be in a different line and using the format below:对于 stopwords.sed 文件,每个 stopword 必须位于不同的行并使用以下格式:

s|\<xxxxx\>||g

Where 'xxxxx' would be the stopword itself.其中 'xxxxx' 将是停用词本身。

s|\<the\>||g

The line above would remove all occurrences of 'the' (without single quotes)上面的行将删除所有出现的“the”(不带单引号)

Worth a try.值得一试。

I think this kind of worked... but it's incredibly slow so if anyone has any suggestions on how to make this more efficient I'd really appreciate it!我认为这种工作......但它非常慢,所以如果有人对如何提高效率有任何建议,我真的很感激!

import re
from stemming.porter2 import stem as PT


with open('text.txt', 'r') as f, open('stopwords.txt','r') as st:

    f_content = f.read()
    processed = re.split('[^a-zA-Z]', f_content)
    processed = [x.lower() for x in processed]
    processed = [PT(x) for x in processed]
    #print(processed)

    st_content = st.read()
    st_list = set(st_content.split())

    clean_text = [x for x in processed if x not in st_list]
    print clean_text

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM