从文本文件中删除所有标点符号、空格和其他非字母字符，包括数字

Question

I downloaded a book from gutenberg project and saved it as a text file.我从古腾堡项目下载了一本书并将其保存为文本文件。 I started to use the below code as initial steps.我开始使用下面的代码作为初始步骤。

Book_name = 'Animals.txt'                       
fd = open(Book_name, encoding='utf8')        
Animals = fd.read()                            
print (type(Animals), len(Animals))
words = Animals.split()
print(type(words), len(words))
fd.close()

I have read the book I chose (the text file), then I have done the below:我已经阅读了我选择的书（文本文件），然后我完成了以下操作：

def remove_punc(string):
punc = '''!()-[]{};:'"\, <>./?@#$%^&*_~12345678“90σ\nθμνëη=χéὁλςπε”οκ£ι§ρτυαωæδàγψ'''
for ele in string:  
    if ele in punc:  
        string = string.replace(ele, "") 
return string


try:
with open(filename,'r',encoding="utf-8") as f:
    data = f.read()
with open(filename,"w+",encoding="utf-8") as f:
    f.write(remove_punc(data))
print("Removed punctuations from the file", filename)

It didn't work, so I couldn't proceed with the rest它不起作用，所以我无法继续使用 rest

2nd solution below:下面的第二个解决方案：

Answer 1

Wouldn't be easier like this?这样不会更容易吗？

from string import digits

    yourfile
    tokenizer = nltk.RegexpTokenizer(r"\w+")
    clean_text = tokenizer.tokenize(yourfile)
    my_string= (" ".join(clean_text))
    newstring = my_string.translate(None, digits)
    print(newstring)

that is, instead of removing what you don't want, get what you want.也就是说，与其去掉你不想要的，不如得到你想要的。 You get your list of words, then turn that into a string, remove the numbers from the string with the translate method.你得到你的单词列表，然后把它变成一个字符串，用 translate 方法从字符串中删除数字。

Answer 2

So If I understand you correctly, you want to remove literally every character except for AZ and az?所以如果我理解正确的话，你想从字面上删除除 AZ 和 az 之外的每个字符？

import re
pattern = re.compile('[^A-Za-z]')
data = ''
with open(filename,'r',encoding="utf-8") as f:
    data = pattern.sub('', f.read())
with open(filename,"w+",encoding="utf-8") as f:
    f.write(data)

Answer 3

You can use the translate() method.您可以使用 translate() 方法。 First prepare a translation table that will remove punctuation.首先准备一个将删除标点符号的翻译表。 Then use it directly on your input data to write the output.然后直接在您的输入数据上使用它来写入 output。

punc = '''!()-[]{};:'"\, <>./?@#$%^&*_~12345678“90σ\nθμνëη=χéὁλςπε”οκ£ι§ρτυαωæδàγψ'''

removePunctuation = str.maketrans('','',punc)   # translation table

with open(filename,'r',encoding="utf-8") as f:
    data = f.read()

with open(filename,"w+",encoding="utf-8") as f:
    f.write(data.translate(removePunctuation))  # use translate directly

print("Removed punctuations from the file", filename)

You seem to want more characters to be excluded than mere punctuation but you can get most of these characters from the string module:您似乎希望排除更多字符而不仅仅是标点符号，但您可以从字符串模块中获取大部分字符：

import string

punc = ' ' + string.punctuation + string.digits + "your extra chars"

从文本文件中删除所有标点符号、空格和其他非字母字符，包括数字

问题描述

3 个解决方案

解决方案1
0 2021-11-17 19:10:42

解决方案2
0 2021-11-17 19:10:44

解决方案3
0 2021-11-17 19:12:12

从文本文件中删除所有标点符号、空格和其他非字母字符，包括数字

问题描述

3 个解决方案

解决方案1 0 2021-11-17 19:10:42

解决方案2 0 2021-11-17 19:10:44

解决方案3 0 2021-11-17 19:12:12

解决方案1
0 2021-11-17 19:10:42

解决方案2
0 2021-11-17 19:10:44

解决方案3
0 2021-11-17 19:12:12