简体   繁体   中英

Reading, editing, then saving over a text(.txt) file as a list

I am new to Python so I could use a lot of help here! My goal is to take an article and filter out all of the trash words then eventually import them to excel so I can do some text analysis. As it stands, the articles are too long to copy into a single cell due to size limitations. I have the following code:

article = open(filename, 'w')

letters_only = re.sub("[^a-zA-Z]",  # Search for all non-letters
                          " ",          # Replace all non-letters with spaces
                          str(article))

stop_words = set(stopwords.words('english')) 

# Tokenize the article: tokens
tokens = word_tokenize(letters_only)

# Convert the tokens into lowercase: lower_tokens
lower_tokens = [t.lower() for t in tokens]

# Retain alphabetic words: alpha_only
alpha_only = [t for t in lower_tokens if t.isalpha()]

filtered_sentence = [w for w in alpha_only if not w in stop_words] 

filtered_sentence = [] 

for w in alpha_only: 
    if w not in stop_words: 
        filtered_sentence.append(w)

article.write(str(filtered_sentence))

The problem I am having is that when I try to write the file, the code basically deletes all of the text and writes over it with nothing. If there is an easier way of just prepping a file for machine learning and/or just stripping a file of stop_words and saving over it, I would appreciate it.

You didn't provided all your code, as read is not mentioned anywhere, in order to help you, we need to have more context. I will still try to help you with what you provided.

If you load your article from the web, I advise you to keep it as a plain string (aka. not saving it in a file), clean it from what you don't want, then save it.

Otherwise, if you load it from a file, you may prefer to save the cleaned article in another file, then remove the original one. It prevents from losing data.

Here, your code remove the content of the file due to the w flag, and prints nothing on it

'w' -> Truncate file to zero length or create text file for writing. The stream is positioned at the beginning of the file.

Also, filtered_sentence is a list of strings, you can't convert it in a single one like that

article.write(str(filtered_sentence))

you should do the following

article.write(" ".join(filtered_sentence))

You may consider use the with statement, which close the file automatically, which you don't seems to do.

with open(filename, 'w') as article:
    article.write(" ".join(filtered_sentence))

As you added more context in the comments of my previous answer, I prefer to rewrite all of it.

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from textract import process
import urllib2, os, re

response = urllib2.urlopen('http://www.website.com/file.pdf') #Request file from interet
tmp_path = 'tmp_path/file.pdf'

with open(tmp_path, 'wb') as tmp_file: #Open the pdf file
    tmp_file.write(response.read()) #Write the request content (aka the pdf file data)

text = process(tmp_path) #Extract text from pdf
text = re.sub("[^a-zA-Z]", " ", text) #Remove all non alphabetical words

os.remove(tmp_path) #Remove the temp pdf file

words = word_tokenize(text)

#words = [t.lower() for t in lower_tokens if t.isalpha()]
#Is the above line useful as you removed all non alphanumerical character at line 13 ?
stop_words = set(stopwords.words('english'))
filtered_sentence = [w for w in words if w not in stop_words]

with open("path/to/your/article.txt", 'w') as article: #Open destination file
    article.write(" ".join(filtered_sentence)) #Write all the words separated by a space

/!\\ I don't have any python environment to test it (Smartphones... Meh.), but it should work fine. If any error occurs please report it, I will correct it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM