[英]Removing all punctuation, spaces and other non-letter characters including numbers from a text file
I downloaded a book from gutenberg project and saved it as a text file.我从古腾堡项目下载了一本书并将其保存为文本文件。 I started to use the below code as initial steps.
我开始使用下面的代码作为初始步骤。
Book_name = 'Animals.txt'
fd = open(Book_name, encoding='utf8')
Animals = fd.read()
print (type(Animals), len(Animals))
words = Animals.split()
print(type(words), len(words))
fd.close()
I have read the book I chose (the text file), then I have done the below:我已经阅读了我选择的书(文本文件),然后我完成了以下操作:
def remove_punc(string):
punc = '''!()-[]{};:'"\, <>./?@#$%^&*_~12345678“90σ\nθμνëη=χéὁλςπε”οκ£ι§ρτυαωæδàγψ'''
for ele in string:
if ele in punc:
string = string.replace(ele, "")
return string
try:
with open(filename,'r',encoding="utf-8") as f:
data = f.read()
with open(filename,"w+",encoding="utf-8") as f:
f.write(remove_punc(data))
print("Removed punctuations from the file", filename)
It didn't work, so I couldn't proceed with the rest它不起作用,所以我无法继续使用 rest
Wouldn't be easier like this?这样不会更容易吗?
from string import digits
yourfile
tokenizer = nltk.RegexpTokenizer(r"\w+")
clean_text = tokenizer.tokenize(yourfile)
my_string= (" ".join(clean_text))
newstring = my_string.translate(None, digits)
print(newstring)
that is, instead of removing what you don't want, get what you want.也就是说,与其去掉你不想要的,不如得到你想要的。 You get your list of words, then turn that into a string, remove the numbers from the string with the translate method.
你得到你的单词列表,然后把它变成一个字符串,用 translate 方法从字符串中删除数字。
So If I understand you correctly, you want to remove literally every character except for AZ and az?所以如果我理解正确的话,你想从字面上删除除 AZ 和 az 之外的每个字符?
import re
pattern = re.compile('[^A-Za-z]')
data = ''
with open(filename,'r',encoding="utf-8") as f:
data = pattern.sub('', f.read())
with open(filename,"w+",encoding="utf-8") as f:
f.write(data)
You can use the translate() method.您可以使用 translate() 方法。 First prepare a translation table that will remove punctuation.
首先准备一个将删除标点符号的翻译表。 Then use it directly on your input data to write the output.
然后直接在您的输入数据上使用它来写入 output。
punc = '''!()-[]{};:'"\, <>./?@#$%^&*_~12345678“90σ\nθμνëη=χéὁλςπε”οκ£ι§ρτυαωæδàγψ'''
removePunctuation = str.maketrans('','',punc) # translation table
with open(filename,'r',encoding="utf-8") as f:
data = f.read()
with open(filename,"w+",encoding="utf-8") as f:
f.write(data.translate(removePunctuation)) # use translate directly
print("Removed punctuations from the file", filename)
You seem to want more characters to be excluded than mere punctuation but you can get most of these characters from the string module:您似乎希望排除更多字符而不仅仅是标点符号,但您可以从字符串模块中获取大部分字符:
import string
punc = ' ' + string.punctuation + string.digits + "your extra chars"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.