如何在python中只保留字符串中的特定字符和字母？

Question

我正在嘗試在 python 中過濾語料庫，我將其轉換為字符串，我只需要保留英文字母和任何 x 在T=[',', '.', ':', '\\n', '#', '(', ')', '!', '?' ,"'" , '"'] T=[',', '.', ':', '\\n', '#', '(', ')', '!', '?' ,"'" , '"']

我嘗試了幾種方法，但無法成功將特殊字符\\n與其他字符一起保留。

我嘗試過的一件事：

def cleancorpus(self, corptext):
   newtext=corptext
   newtext=newtext.lower()
   
   for i in range(0, len(newtext), 2):
       op, code = newtext[i:i+2]
       if(op=="\\" and code not in {"n"}):
           newtext=newtext.replace(op,"")
   newtext=''.join(x for x in newtext if x.isalpha() or x in T or x==' ')
   return newtext

但是，它返回一個ValueError: not enough values to unpack (expected 2, got 1). 我也試過逐個字符遍歷字符串字符，但我的問題主要是[\\n, ", '] 。

Answer 1

解決了：

將語料庫視為類字節對象：

class CorpusReader:
    
    
    def __init__(self, URL):
        with urllib.request.urlopen(URL) as response:
            text = response.read()
        response.close
        text = self.cleancorpus(text)
        
        
        
        
        
    def cleancorpus(self, corptext):
       newtext=corptext
       newtext=newtext.lower()
       pattern = bytes(b'''[^ a-z,.:\n#()!?'"]''')
       newtext= re.sub(pattern, b'', newtext)
       return newtext

Answer 2

您的嘗試中有四個單獨的錯誤。

首先，您只循環遍歷字符對。 如果反斜杠作為一對字符的第二個字符出現，您不會注意到。

其次，當字符數不是偶數時，您試圖檢查字符串的末尾。 這就是導致您實際詢問的錯誤的原因。 （ newtext[i:i+2]僅在i比newtext的長度小 1 時生成單個字符；因此，對兩個單獨變量的賦值失敗，因為表達式僅生成一個值。）

第三，換行符是單個字符。 Python 源代碼中字符串中的序列\\n用一個兩個字符長的符號序列表示這個字符，但在它所表示的字符串中，沒有反斜杠也沒有n ，只有一個字符（也稱為\\x0a或\ 又名換行符）。

第四， isalpha()對於像ß和ä和日這樣的字符實際上是正確的。

反斜杠處理等的常見安排是實現一個滑動窗口，以便您檢查從字符串中每個字符位置開始的兩個字符。

   # Still broken; looks for literal \ followed by n
   # Still broken: isalpha() is wrong for the use case
   newtext = []
   skip = False
   for i in range(len(corptext)):
       if skip:
           skip = False
           continue
       op = corptext[i].lower()
       # Stylistically, use equality for both comparisons
       if op == "\\" and i < len(corptext)-1 and corptext[i+1] != "n":
           # Tell the next iteration to skip the next character, too
           skip = True
           continue
       elif op.isalpha() or op in T or op == ' ':
           newtext.append(op)
   return ''.join(newtext)

作為一個小效率黑客，我們將新文本收集到一個列表中，並在最后將它們重新連接成一個字符串。 附加到列表比附加到字符串要快得多，因此我們避免在循環中執行后者。

但是對於您的實際任務，可以使用更簡單的解決方案：

import re

def cleancorpus(self, corptext):
    return re.sub(r'''[^ a-zA-Z,.:\n#()!?'"]''', '', corpustext)

self在類之外沒有意義； 這似乎微不足道，應該沒有特別的理由想要將它封裝到一個類中。 但是如果你這樣做了，我想你可以在__init__方法中編譯正則表達式並保存它。 根據您自己發布的答案改編，

class CorpusReader:
    def __init__(self, URL):
        with urllib.request.urlopen(URL) as response:
            # .decode() produces a string from bytes
            # If you don't know the encoding, probably try UTF-8
            # then if that fails, figure out the _actual_ encoding
            text = response.read().decode()
        # You don't need to close when you use "with open(...)"
        # response.close
        self.regex = re.compile(r'''[^ a-zA-Z,.:\n#()!?'"]''')
        self.text = self.cleancorpus(text)
        
    def cleancorpus(self, corptext):
       return re.sub(self.regex, '', corptext).lower()

你的方法對text沒有做任何有用的事情； 這會將其保存為self.text以便您以后可以訪問它。 我保留了.lower() ，它不在您的要求中，但正在您的代碼中使用； 顯然，如果你不想要，就把它拿出來。

decode的參數可以從response.headers['content-type']提取，但對於初學者來說，我想只是硬編碼預期的編碼（如果需要）將是可以接受和足夠的。

如何在python中只保留字符串中的特定字符和字母？

問題描述

2 個解決方案

解決方案1
0 2021-12-18 18:00:20

解決方案2
0 2021-12-20 09:34:17

如何在python中只保留字符串中的特定字符和字母？

問題描述

2 個解決方案

解決方案1 0 2021-12-18 18:00:20

解決方案2 0 2021-12-20 09:34:17

解決方案1
0 2021-12-18 18:00:20

解決方案2
0 2021-12-20 09:34:17