简体   繁体   English

从文本文件中删除所有标点符号、空格和其他非字母字符,包括数字

[英]Removing all punctuation, spaces and other non-letter characters including numbers from a text file

I downloaded a book from gutenberg project and saved it as a text file.我从古腾堡项目下载了一本书并将其保存为文本文件。 I started to use the below code as initial steps.我开始使用下面的代码作为初始步骤。

Book_name = 'Animals.txt'                       
fd = open(Book_name, encoding='utf8')        
Animals = fd.read()                            
print (type(Animals), len(Animals))
words = Animals.split()
print(type(words), len(words))
fd.close()

I have read the book I chose (the text file), then I have done the below:我已经阅读了我选择的书(文本文件),然后我完成了以下操作:

def remove_punc(string):
punc = '''!()-[]{};:'"\, <>./?@#$%^&*_~12345678“90σ\nθμνëη=χéὁλςπε”οκ£ι§ρτυαωæδàγψ'''
for ele in string:  
    if ele in punc:  
        string = string.replace(ele, "") 
return string


try:
with open(filename,'r',encoding="utf-8") as f:
    data = f.read()
with open(filename,"w+",encoding="utf-8") as f:
    f.write(remove_punc(data))
print("Removed punctuations from the file", filename)

It didn't work, so I couldn't proceed with the rest它不起作用,所以我无法继续使用 rest

2nd solution below:下面的第二个解决方案: 下面的第二个解决方案?

Wouldn't be easier like this?这样不会更容易吗?

from string import digits

    yourfile
    tokenizer = nltk.RegexpTokenizer(r"\w+")
    clean_text = tokenizer.tokenize(yourfile)
    my_string= (" ".join(clean_text))
    newstring = my_string.translate(None, digits)
    print(newstring)
    

that is, instead of removing what you don't want, get what you want.也就是说,与其去掉你不想要的,不如得到你想要的。 You get your list of words, then turn that into a string, remove the numbers from the string with the translate method.你得到你的单词列表,然后把它变成一个字符串,用 translate 方法从字符串中删除数字。

So If I understand you correctly, you want to remove literally every character except for AZ and az?所以如果我理解正确的话,你想从字面上删除除 AZ 和 az 之外的每个字符?

import re
pattern = re.compile('[^A-Za-z]')
data = ''
with open(filename,'r',encoding="utf-8") as f:
    data = pattern.sub('', f.read())
with open(filename,"w+",encoding="utf-8") as f:
    f.write(data)

You can use the translate() method.您可以使用 translate() 方法。 First prepare a translation table that will remove punctuation.首先准备一个将删除标点符号的翻译表。 Then use it directly on your input data to write the output.然后直接在您的输入数据上使用它来写入 output。

punc = '''!()-[]{};:'"\, <>./?@#$%^&*_~12345678“90σ\nθμνëη=χéὁλςπε”οκ£ι§ρτυαωæδàγψ'''

removePunctuation = str.maketrans('','',punc)   # translation table

with open(filename,'r',encoding="utf-8") as f:
    data = f.read()

with open(filename,"w+",encoding="utf-8") as f:
    f.write(data.translate(removePunctuation))  # use translate directly

print("Removed punctuations from the file", filename)

You seem to want more characters to be excluded than mere punctuation but you can get most of these characters from the string module:您似乎希望排除更多字符而不仅仅是标点符号,但您可以从字符串模块中获取大部分字符:

import string

punc = ' ' + string.punctuation + string.digits + "your extra chars"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从 Python 中带重音的字符串中删除所有非字母字符 - Removing all non-letter chars from a string with accents in Python 快速删除字符串中的所有非字母字符 - Deleting all non-letter characters from a string fast, python 如何从字符串中删除所有非字母(所有语言)和非数字字符? - How can I remove all non-letter (all languages) and non-numeric characters from a string? 从单词的开头和结尾删除非字母字符 - Remove non-letter characters from beginning and end of a word 从 Jupyter 中的文本文件中删除所有英语和其他标点符号 - Removing all English and other punctuation form the text file in Jupyter Python:如何忽略非字母字符并将所有字母字符都视为小写? - Python: How to ignore non-letter characters and treat all alphabetic characters as lower case? 正则表达式仅从文件中获取以字母开头的单词,并在 python 中删除仅包含数字和标点符号的单词 - Regular Expression to get only words from file starting with letter and removing words with only numbers and punctuation in python 从文本问题中删除标点符号/数字 - Removing punctuation/numbers from text problem 使用前面带有非字母字符的空格拆分文本 - Split text with a space that is preceded with a non-letter char 从字符串中删除所有特殊字符、标点符号和空格 - Remove all special characters, punctuation and spaces from string
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM