简体   繁体   English

文本预处理错误:['Errno 21]是目录

[英]Text Pre-processing Error: ['Errno 21] Is a directory

I am trying to get all files from my directory and then run them through a series of def functions (python 3) and outputting each processed file into a certain directory. 我试图从目录中获取所有文件,然后通过一系列def函数(python 3)运行它们,并将每个处理过的文件输出到某个目录中。 Below is my code: 下面是我的代码:

   import re 
import glob
import sys
import string

#Create Stop_word Corpora
file1=open("/home/file/corps/stopwords.txt", 'rt', encoding='latin-1')
line= file1.read()
theWords=line.split()
stop_words=sorted(set(theWords)) # Stop Word Corpora

#Gather txt files to be processed
folder_path = "/home/file"
file_pattern = "/*txt"
folder_contents = glob.glob(folder_path + file_pattern)

#Read in the Txt Files
for file in folder_contents:
    print("Checking", file)
words= []
for file in folder_contents:
    read_file = open(file, 'rt', encoding='latin-1').read()
    words.extend(read_file.split())

def to_lowercase(words):
#"""Convert all characters to lowercase from list of tokenized words"""
    new_words=[]
    for word in words:
        new_word=word.lower()
        new_words.append(new_word)
    return new_words
def remove_punctuation(words):
#"""Remove punctuation from list of tokenized words"""
    new_words=[]
    for word in words:
        new_word = re.sub(r'[^\w\s]', '', word)
        if new_word != '':
            new_words.append(new_word)
    return new_words
def replace_numbers(words):
#""""""Replace all interger occurrences in list of tokenized words with textual representation"
    new_words=[]
    for word in words:
        new_word= re.sub(" \d+", " ", word)
    if new_word !='':
        new_words.append(new_word)
    return new_words

def remove_stopwords(words):
#"""Remove stop words from list of tokenized words"""
    new_words=[]
    for word in words:
        if not word in stop_words:
            new_words.append(word)
    return new_words
def normalize(words):
    words = to_lowercase(words)

    words = remove_punctuation(words)

    words = replace_numbers(words)

    words = remove_stopwords(words)
    return words

words = normalize(words)

# Write the new procssed file to a different location
append_file=open("/home/file/Processed_Files",'a')
append_file.write("\n".join(words))

This is the error I keep receiving: 这是我不断收到的错误:

在此处输入图片说明

I want the new text files to be sent to the directory above, after they have been ran through the def functions. 我希望新的文本文件通过def函数运行后,发送到上面的目录中。 So there should be 5 new files in the Processed_files directory above. 因此,上面的Processed_files目录中应该有5个新文件。

The traceback you present doesn't agree with the error reported in your question title. 您提供的回溯与问题标题中报告的错误不同。

But your code does this twice: 但是您的代码会执行两次:

for word in words:
    new_word = re.sub(r'[^\w\s]', '', word)
if new_word != '':
    new_words.append(new_word)

If words is empty, then the for word in words loop never gets executed, even once. 如果words为空,则for word in wordsfor word in words循环永远不会执行,甚至不会执行一次。 And if it doesn't get executed even once then no value ever gets assigned to new_word . 而且,即使一次都没有执行,则不会为new_word So, in that case, when your code does if new_word != '': you will get the error new_word referenced before assignment . 因此,在这种情况下,当代码执行if new_word != '':您将new_word referenced before assignment得到new_word referenced before assignment错误。 That is because your code is asking what is in new_word but it is unassigned. 那是因为您的代码正在询问new_word但未分配。

This problem will go away if you code it like this: 如果这样编码,此问题将消失:

for word in words:
    new_word = re.sub(r'[^\w\s]', '', word)
    if new_word != '':
        new_words.append(new_word)

which I suspect is what you meant, anyway. 无论如何,我怀疑你的意思。

I would suggest 3 changes: 我建议3个更改:

  1. Create an empty list and add all words to it 创建一个空列表并向其中添加所有单词

     words = [] for file in folder_contents: read_file = open(file, 'rt', encoding='latin-1').read() words.extend(read_file.split()) 
  2. Correctly convert a list into a str 正确将列表转换为str

     append_file.write("\\n".join(words))) 
  3. Fix incorrect indentation 修复不正确的缩进

     words = normalize(words) 

    and

     for word in words: new_word = re.sub(r'[^\\w\\s]', '', word) if new_word != '': new_words.append(new_word) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM