從文本文件中刪除所有標點符號、空格和其他非字母字符，包括數字

Question

我從古騰堡項目下載了一本書並將其保存為文本文件。 我開始使用下面的代碼作為初始步驟。

Book_name = 'Animals.txt'                       
fd = open(Book_name, encoding='utf8')        
Animals = fd.read()                            
print (type(Animals), len(Animals))
words = Animals.split()
print(type(words), len(words))
fd.close()

我已經閱讀了我選擇的書（文本文件），然后我完成了以下操作：

def remove_punc(string):
punc = '''!()-[]{};:'"\, <>./?@#$%^&*_~12345678“90σ\nθμνëη=χéὁλςπε”οκ£ι§ρτυαωæδàγψ'''
for ele in string:  
    if ele in punc:  
        string = string.replace(ele, "") 
return string


try:
with open(filename,'r',encoding="utf-8") as f:
    data = f.read()
with open(filename,"w+",encoding="utf-8") as f:
    f.write(remove_punc(data))
print("Removed punctuations from the file", filename)

它不起作用，所以我無法繼續使用 rest

下面的第二個解決方案：

Answer 1

這樣不會更容易嗎？

from string import digits

    yourfile
    tokenizer = nltk.RegexpTokenizer(r"\w+")
    clean_text = tokenizer.tokenize(yourfile)
    my_string= (" ".join(clean_text))
    newstring = my_string.translate(None, digits)
    print(newstring)

也就是說，與其去掉你不想要的，不如得到你想要的。 你得到你的單詞列表，然后把它變成一個字符串，用 translate 方法從字符串中刪除數字。

Answer 2

所以如果我理解正確的話，你想從字面上刪除除 AZ 和 az 之外的每個字符？

import re
pattern = re.compile('[^A-Za-z]')
data = ''
with open(filename,'r',encoding="utf-8") as f:
    data = pattern.sub('', f.read())
with open(filename,"w+",encoding="utf-8") as f:
    f.write(data)

Answer 3

您可以使用 translate() 方法。 首先准備一個將刪除標點符號的翻譯表。 然后直接在您的輸入數據上使用它來寫入 output。

punc = '''!()-[]{};:'"\, <>./?@#$%^&*_~12345678“90σ\nθμνëη=χéὁλςπε”οκ£ι§ρτυαωæδàγψ'''

removePunctuation = str.maketrans('','',punc)   # translation table

with open(filename,'r',encoding="utf-8") as f:
    data = f.read()

with open(filename,"w+",encoding="utf-8") as f:
    f.write(data.translate(removePunctuation))  # use translate directly

print("Removed punctuations from the file", filename)

您似乎希望排除更多字符而不僅僅是標點符號，但您可以從字符串模塊中獲取大部分字符：

import string

punc = ' ' + string.punctuation + string.digits + "your extra chars"

從文本文件中刪除所有標點符號、空格和其他非字母字符，包括數字

問題描述

3 個解決方案

解決方案1
0 2021-11-17 19:10:42

解決方案2
0 2021-11-17 19:10:44

解決方案3
0 2021-11-17 19:12:12

從文本文件中刪除所有標點符號、空格和其他非字母字符，包括數字

問題描述

3 個解決方案

解決方案1 0 2021-11-17 19:10:42

解決方案2 0 2021-11-17 19:10:44

解決方案3 0 2021-11-17 19:12:12

解決方案1
0 2021-11-17 19:10:42

解決方案2
0 2021-11-17 19:10:44

解決方案3
0 2021-11-17 19:12:12