[英]Text Pre-processing Error: ['Errno 21] Is a directory
I am trying to get all files from my directory and then run them through a series of def functions (python 3) and outputting each processed file into a certain directory. 我试图从目录中获取所有文件,然后通过一系列def函数(python 3)运行它们,并将每个处理过的文件输出到某个目录中。 Below is my code: 下面是我的代码:
import re
import glob
import sys
import string
#Create Stop_word Corpora
file1=open("/home/file/corps/stopwords.txt", 'rt', encoding='latin-1')
line= file1.read()
theWords=line.split()
stop_words=sorted(set(theWords)) # Stop Word Corpora
#Gather txt files to be processed
folder_path = "/home/file"
file_pattern = "/*txt"
folder_contents = glob.glob(folder_path + file_pattern)
#Read in the Txt Files
for file in folder_contents:
print("Checking", file)
words= []
for file in folder_contents:
read_file = open(file, 'rt', encoding='latin-1').read()
words.extend(read_file.split())
def to_lowercase(words):
#"""Convert all characters to lowercase from list of tokenized words"""
new_words=[]
for word in words:
new_word=word.lower()
new_words.append(new_word)
return new_words
def remove_punctuation(words):
#"""Remove punctuation from list of tokenized words"""
new_words=[]
for word in words:
new_word = re.sub(r'[^\w\s]', '', word)
if new_word != '':
new_words.append(new_word)
return new_words
def replace_numbers(words):
#""""""Replace all interger occurrences in list of tokenized words with textual representation"
new_words=[]
for word in words:
new_word= re.sub(" \d+", " ", word)
if new_word !='':
new_words.append(new_word)
return new_words
def remove_stopwords(words):
#"""Remove stop words from list of tokenized words"""
new_words=[]
for word in words:
if not word in stop_words:
new_words.append(word)
return new_words
def normalize(words):
words = to_lowercase(words)
words = remove_punctuation(words)
words = replace_numbers(words)
words = remove_stopwords(words)
return words
words = normalize(words)
# Write the new procssed file to a different location
append_file=open("/home/file/Processed_Files",'a')
append_file.write("\n".join(words))
This is the error I keep receiving: 这是我不断收到的错误:
I want the new text files to be sent to the directory above, after they have been ran through the def functions. 我希望新的文本文件通过def函数运行后,发送到上面的目录中。 So there should be 5 new files in the Processed_files directory above. 因此,上面的Processed_files目录中应该有5个新文件。
The traceback you present doesn't agree with the error reported in your question title. 您提供的回溯与问题标题中报告的错误不同。
But your code does this twice: 但是您的代码会执行两次:
for word in words:
new_word = re.sub(r'[^\w\s]', '', word)
if new_word != '':
new_words.append(new_word)
If words
is empty, then the for word in words
loop never gets executed, even once. 如果words
为空,则for word in words
的for word in words
循环永远不会执行,甚至不会执行一次。 And if it doesn't get executed even once then no value ever gets assigned to new_word
. 而且,即使一次都没有执行,则不会为new_word
。 So, in that case, when your code does if new_word != '':
you will get the error new_word referenced before assignment
. 因此,在这种情况下,当代码执行if new_word != '':
您将new_word referenced before assignment
得到new_word referenced before assignment
错误。 That is because your code is asking what is in new_word
but it is unassigned. 那是因为您的代码正在询问new_word
但未分配。
This problem will go away if you code it like this: 如果这样编码,此问题将消失:
for word in words:
new_word = re.sub(r'[^\w\s]', '', word)
if new_word != '':
new_words.append(new_word)
which I suspect is what you meant, anyway. 无论如何,我怀疑你的意思。
I would suggest 3 changes: 我建议3个更改:
Create an empty list and add all words to it 创建一个空列表并向其中添加所有单词
words = [] for file in folder_contents: read_file = open(file, 'rt', encoding='latin-1').read() words.extend(read_file.split())
Correctly convert a list into a str 正确将列表转换为str
append_file.write("\\n".join(words)))
Fix incorrect indentation 修复不正确的缩进
words = normalize(words)
and 和
for word in words: new_word = re.sub(r'[^\\w\\s]', '', word) if new_word != '': new_words.append(new_word)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.