简体   繁体   中英

Removing all stopwords defined in a file from a text in another file (Python)

I have two text files:

  1. Stopwords.txt --> contains stop words one per line
  2. text.txt --> big document file

I'm trying to remove all occurences of stopwords (any word in the stopwords.txt file) from the text.txt file without using NLTK (school assignment).

How would I go about doing this? This is my code so far.

import re

with open('text.txt', 'r') as f, open('stopwords.txt','r') as st:
    f_content = f.read()
    #splitting text.txt by non alphanumeric characters
    processed = re.split('[^a-zA-Z]', f_content)

    st_content = st.read()
    #splitting stopwords.txt by new line
    st_list = re.split('\n', st_content)
    #print(st_list) to check it was working

    #what I'm trying to do is: traverse through the text. If stopword appears, 
    #remove it. otherwise keep it. 
    for word in st_list:
        f_content = f_content.replace(word, "")
        print(f_content) 

but when I run the code, it first takes forever to output something and when it does it just outputs the entire text file. (I'm new to python so let me know if I'm doing something fundamentally wrong!)

Here is what I use when I need to remove English stop words. I usually also use the corpus from nltk instead of my own file for stop words.

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
ps = PorterStemmer()

## Remove stop words
stops = set(stopwords.words("english"))
text = [ps.stem(w) for w in text if not w in stops and len(w) >= 3]
text = list(set(text)) #remove duplicates
text = " ".join(text)

For your special case I would do something like:

stops = list_of_words_from_file

Let me know if I answered your question, I am not sure if the problem is the read from file or the stemming.

Edit: To remove all stopwords defined in a file from a text in another file, we can use str.replace()

for word in st_list:
    f_content=f_content.replace(word)

Based on the fact you are facing performance issues. I would suggest using the subprocess library ( for Python2 , or for Python3 ) to call the sed linux command.

I know Python is really good for this kind of thing (and many others), but if you have a really big text.txt. I would try the old, ugly, and powerful command-line 'sed'.

Try something like:

sed -f stopwords.sed text.txt > output_file.txt

For the stopwords.sed file, each stopword must be in a different line and using the format below:

s|\<xxxxx\>||g

Where 'xxxxx' would be the stopword itself.

s|\<the\>||g

The line above would remove all occurrences of 'the' (without single quotes)

Worth a try.

I think this kind of worked... but it's incredibly slow so if anyone has any suggestions on how to make this more efficient I'd really appreciate it!

import re
from stemming.porter2 import stem as PT


with open('text.txt', 'r') as f, open('stopwords.txt','r') as st:

    f_content = f.read()
    processed = re.split('[^a-zA-Z]', f_content)
    processed = [x.lower() for x in processed]
    processed = [PT(x) for x in processed]
    #print(processed)

    st_content = st.read()
    st_list = set(st_content.split())

    clean_text = [x for x in processed if x not in st_list]
    print clean_text

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM