简体   繁体   English

读取,编辑,然后将文本(.txt)文件保存为列表

[英]Reading, editing, then saving over a text(.txt) file as a list

I am new to Python so I could use a lot of help here! 我是Python的新手,所以我可以在这里使用很多帮助! My goal is to take an article and filter out all of the trash words then eventually import them to excel so I can do some text analysis. 我的目标是撰写一篇文章,过滤掉所有垃圾词,然后最终将它们导入到excel中,以便进行一些文本分析。 As it stands, the articles are too long to copy into a single cell due to size limitations. 就目前而言,由于尺寸限制,物品太长而无法复制到单个单元格中。 I have the following code: 我有以下代码:

article = open(filename, 'w')

letters_only = re.sub("[^a-zA-Z]",  # Search for all non-letters
                          " ",          # Replace all non-letters with spaces
                          str(article))

stop_words = set(stopwords.words('english')) 

# Tokenize the article: tokens
tokens = word_tokenize(letters_only)

# Convert the tokens into lowercase: lower_tokens
lower_tokens = [t.lower() for t in tokens]

# Retain alphabetic words: alpha_only
alpha_only = [t for t in lower_tokens if t.isalpha()]

filtered_sentence = [w for w in alpha_only if not w in stop_words] 

filtered_sentence = [] 

for w in alpha_only: 
    if w not in stop_words: 
        filtered_sentence.append(w)

article.write(str(filtered_sentence))

The problem I am having is that when I try to write the file, the code basically deletes all of the text and writes over it with nothing. 我遇到的问题是,当我尝试写入文件时,代码基本上删除了所有文本,并没有任何内容覆盖它。 If there is an easier way of just prepping a file for machine learning and/or just stripping a file of stop_words and saving over it, I would appreciate it. 如果有一种更简单的方法只是准备一个文件供机器学习和/或剥离一个stop_words文件并保存下来,我将不胜感激。

You didn't provided all your code, as read is not mentioned anywhere, in order to help you, we need to have more context. 您没有提供所有代码,因为在任何地方都没有提及阅读,所以为了帮助您,我们需要更多的上下文。 I will still try to help you with what you provided. 我仍将尽力为您提供所提供的帮助。

If you load your article from the web, I advise you to keep it as a plain string (aka. not saving it in a file), clean it from what you don't want, then save it. 如果您是从网上加载文章的,建议您将其保留为纯字符串(也就是不要将其保存在文件中),清除不需要的内容,然后再保存。

Otherwise, if you load it from a file, you may prefer to save the cleaned article in another file, then remove the original one. 否则,如果您从文件中加载它,则可能更喜欢将已清理的文章保存在另一个文件中,然后删除原始文章。 It prevents from losing data. 它防止丢失数据。

Here, your code remove the content of the file due to the w flag, and prints nothing on it 在这里,由于w标志,您的代码删除了文件的内容,并且在文件上不打印任何内容

'w' -> Truncate file to zero length or create text file for writing. 'w'->将文件截断为零长度或创建要写入的文本文件。 The stream is positioned at the beginning of the file. 流位于文件的开头。

Also, filtered_sentence is a list of strings, you can't convert it in a single one like that 另外,filtered_sentence是一个字符串列表,您不能像这样将其转换为单个字符串

article.write(str(filtered_sentence))

you should do the following 您应该执行以下操作

article.write(" ".join(filtered_sentence))

You may consider use the with statement, which close the file automatically, which you don't seems to do. 您可以考虑使用with语句,它会自动关闭文件,您似乎并没有这样做。

with open(filename, 'w') as article:
    article.write(" ".join(filtered_sentence))

As you added more context in the comments of my previous answer, I prefer to rewrite all of it. 当您在上一个答案的注释中添加了更多上下文时,我倾向于重写所有内容。

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from textract import process
import urllib2, os, re

response = urllib2.urlopen('http://www.website.com/file.pdf') #Request file from interet
tmp_path = 'tmp_path/file.pdf'

with open(tmp_path, 'wb') as tmp_file: #Open the pdf file
    tmp_file.write(response.read()) #Write the request content (aka the pdf file data)

text = process(tmp_path) #Extract text from pdf
text = re.sub("[^a-zA-Z]", " ", text) #Remove all non alphabetical words

os.remove(tmp_path) #Remove the temp pdf file

words = word_tokenize(text)

#words = [t.lower() for t in lower_tokens if t.isalpha()]
#Is the above line useful as you removed all non alphanumerical character at line 13 ?
stop_words = set(stopwords.words('english'))
filtered_sentence = [w for w in words if w not in stop_words]

with open("path/to/your/article.txt", 'w') as article: #Open destination file
    article.write(" ".join(filtered_sentence)) #Write all the words separated by a space

/!\\ I don't have any python environment to test it (Smartphones... Meh.), but it should work fine. /!\\我没有任何python环境可以对其进行测试(智能手机...嗯),但是应该可以正常工作。 If any error occurs please report it, I will correct it. 如果发生任何错误,请报告,我将予以纠正。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM