[英]Text Pre-processing Error: ['Errno 21] Is a directory
我試圖從目錄中獲取所有文件,然后通過一系列def函數(python 3)運行它們,並將每個處理過的文件輸出到某個目錄中。 下面是我的代碼:
import re
import glob
import sys
import string
#Create Stop_word Corpora
file1=open("/home/file/corps/stopwords.txt", 'rt', encoding='latin-1')
line= file1.read()
theWords=line.split()
stop_words=sorted(set(theWords)) # Stop Word Corpora
#Gather txt files to be processed
folder_path = "/home/file"
file_pattern = "/*txt"
folder_contents = glob.glob(folder_path + file_pattern)
#Read in the Txt Files
for file in folder_contents:
print("Checking", file)
words= []
for file in folder_contents:
read_file = open(file, 'rt', encoding='latin-1').read()
words.extend(read_file.split())
def to_lowercase(words):
#"""Convert all characters to lowercase from list of tokenized words"""
new_words=[]
for word in words:
new_word=word.lower()
new_words.append(new_word)
return new_words
def remove_punctuation(words):
#"""Remove punctuation from list of tokenized words"""
new_words=[]
for word in words:
new_word = re.sub(r'[^\w\s]', '', word)
if new_word != '':
new_words.append(new_word)
return new_words
def replace_numbers(words):
#""""""Replace all interger occurrences in list of tokenized words with textual representation"
new_words=[]
for word in words:
new_word= re.sub(" \d+", " ", word)
if new_word !='':
new_words.append(new_word)
return new_words
def remove_stopwords(words):
#"""Remove stop words from list of tokenized words"""
new_words=[]
for word in words:
if not word in stop_words:
new_words.append(word)
return new_words
def normalize(words):
words = to_lowercase(words)
words = remove_punctuation(words)
words = replace_numbers(words)
words = remove_stopwords(words)
return words
words = normalize(words)
# Write the new procssed file to a different location
append_file=open("/home/file/Processed_Files",'a')
append_file.write("\n".join(words))
這是我不斷收到的錯誤:
我希望新的文本文件通過def函數運行后,發送到上面的目錄中。 因此,上面的Processed_files目錄中應該有5個新文件。
您提供的回溯與問題標題中報告的錯誤不同。
但是您的代碼會執行兩次:
for word in words:
new_word = re.sub(r'[^\w\s]', '', word)
if new_word != '':
new_words.append(new_word)
如果words
為空,則for word in words
的for word in words
循環永遠不會執行,甚至不會執行一次。 而且,即使一次都沒有執行,則不會為new_word
。 因此,在這種情況下,當代碼執行if new_word != '':
您將new_word referenced before assignment
得到new_word referenced before assignment
錯誤。 那是因為您的代碼正在詢問new_word
但未分配。
如果這樣編碼,此問題將消失:
for word in words:
new_word = re.sub(r'[^\w\s]', '', word)
if new_word != '':
new_words.append(new_word)
無論如何,我懷疑你的意思。
我建議3個更改:
創建一個空列表並向其中添加所有單詞
words = [] for file in folder_contents: read_file = open(file, 'rt', encoding='latin-1').read() words.extend(read_file.split())
正確將列表轉換為str
append_file.write("\\n".join(words)))
修復不正確的縮進
words = normalize(words)
和
for word in words: new_word = re.sub(r'[^\\w\\s]', '', word) if new_word != '': new_words.append(new_word)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.