讀取，編輯，然后將文本（.txt）文件保存為列表

Question

我是Python的新手，所以我可以在這里使用很多幫助！ 我的目標是撰寫一篇文章，過濾掉所有垃圾詞，然后最終將它們導入到excel中，以便進行一些文本分析。 就目前而言，由於尺寸限制，物品太長而無法復制到單個單元格中。 我有以下代碼：

article = open(filename, 'w')

letters_only = re.sub("[^a-zA-Z]",  # Search for all non-letters
                          " ",          # Replace all non-letters with spaces
                          str(article))

stop_words = set(stopwords.words('english')) 

# Tokenize the article: tokens
tokens = word_tokenize(letters_only)

# Convert the tokens into lowercase: lower_tokens
lower_tokens = [t.lower() for t in tokens]

# Retain alphabetic words: alpha_only
alpha_only = [t for t in lower_tokens if t.isalpha()]

filtered_sentence = [w for w in alpha_only if not w in stop_words] 

filtered_sentence = [] 

for w in alpha_only: 
    if w not in stop_words: 
        filtered_sentence.append(w)

article.write(str(filtered_sentence))

我遇到的問題是，當我嘗試寫入文件時，代碼基本上刪除了所有文本，並沒有任何內容覆蓋它。 如果有一種更簡單的方法只是准備一個文件供機器學習和/或剝離一個stop_words文件並保存下來，我將不勝感激。

Answer 1

您沒有提供所有代碼，因為在任何地方都沒有提及閱讀，所以為了幫助您，我們需要更多的上下文。 我仍將盡力為您提供所提供的幫助。

如果您是從網上加載文章的，建議您將其保留為純字符串（也就是不要將其保存在文件中），清除不需要的內容，然后再保存。

否則，如果您從文件中加載它，則可能更喜歡將已清理的文章保存在另一個文件中，然后刪除原始文章。 它防止丟失數據。

在這里，由於w標志，您的代碼刪除了文件的內容，並且在文件上不打印任何內容

'w'->將文件截斷為零長度或創建要寫入的文本文件。 流位於文件的開頭。

另外，filtered_sentence是一個字符串列表，您不能像這樣將其轉換為單個字符串

article.write(str(filtered_sentence))

您應該執行以下操作

article.write(" ".join(filtered_sentence))

您可以考慮使用with語句，它會自動關閉文件，您似乎並沒有這樣做。

with open(filename, 'w') as article:
    article.write(" ".join(filtered_sentence))

Answer 2

當您在上一個答案的注釋中添加了更多上下文時，我傾向於重寫所有內容。

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from textract import process
import urllib2, os, re

response = urllib2.urlopen('http://www.website.com/file.pdf') #Request file from interet
tmp_path = 'tmp_path/file.pdf'

with open(tmp_path, 'wb') as tmp_file: #Open the pdf file
    tmp_file.write(response.read()) #Write the request content (aka the pdf file data)

text = process(tmp_path) #Extract text from pdf
text = re.sub("[^a-zA-Z]", " ", text) #Remove all non alphabetical words

os.remove(tmp_path) #Remove the temp pdf file

words = word_tokenize(text)

#words = [t.lower() for t in lower_tokens if t.isalpha()]
#Is the above line useful as you removed all non alphanumerical character at line 13 ?
stop_words = set(stopwords.words('english'))
filtered_sentence = [w for w in words if w not in stop_words]

with open("path/to/your/article.txt", 'w') as article: #Open destination file
    article.write(" ".join(filtered_sentence)) #Write all the words separated by a space

/！\\我沒有任何python環境可以對其進行測試（智能手機...嗯），但是應該可以正常工作。 如果發生任何錯誤，請報告，我將予以糾正。

讀取，編輯，然后將文本（.txt）文件保存為列表

問題描述

2 個解決方案

解決方案1
0 2018-10-16 14:17:49

解決方案2
0 2018-10-16 15:34:03

讀取，編輯，然后將文本（.txt）文件保存為列表

問題描述

2 個解決方案

解決方案1 0 2018-10-16 14:17:49

解決方案2 0 2018-10-16 15:34:03

解決方案1
0 2018-10-16 14:17:49

解決方案2
0 2018-10-16 15:34:03