從 python 中的字符串中刪除控制字符

Question

我目前有以下代碼

def removeControlCharacters(line):
    i = 0
    for c in line:
        if (c < chr(32)):
            line = line[:i - 1] + line[i+1:]
            i += 1
    return line

如果要刪除的字符不止一個，這將不起作用。

Answer 1

Unicode 中有數百個控制字符。 如果您正在清理來自網絡或其他可能包含非 ascii 字符的數據源，您將需要 Python 的unicodedata 模塊。 unicodedata.category(…)函數返回任何字符的unicode 類別代碼（例如，控制字符、空格、字母等）。 對於控制字符，類別總是以“C”開頭。

此代碼段從字符串中刪除所有控制字符。

import unicodedata
def remove_control_characters(s):
    return "".join(ch for ch in s if unicodedata.category(ch)[0]!="C")

Unicode 類別示例：

>>> from unicodedata import category
>>> category('\r')      # carriage return --> Cc : control character
'Cc'
>>> category('\0')      # null character ---> Cc : control character
'Cc'
>>> category('\t')      # tab --------------> Cc : control character
'Cc'
>>> category(' ')       # space ------------> Zs : separator, space
'Zs'
>>> category(u'\u200A') # hair space -------> Zs : separator, space
'Zs'
>>> category(u'\u200b') # zero width space -> Cf : control character, formatting
'Cf'
>>> category('A')       # letter "A" -------> Lu : letter, uppercase
'Lu'
>>> category(u'\u4e21') # 両 ---------------> Lo : letter, other
'Lo'
>>> category(',')       # comma  -----------> Po : punctuation
'Po'
>>>

Answer 2

您可以將str.translate與適當的地圖一起使用，例如：

>>> mpa = dict.fromkeys(range(32))
>>> 'abc\02de'.translate(mpa)
'abcde'

Answer 3

任何對匹配任何 Unicode 控制字符的正則表達式字符類感興趣的人都可以使用[\\x00-\\x1f\\x7f-\\x9f] 。

你可以這樣測試：

>>> import unicodedata, re, sys
>>> all_chars = [chr(i) for i in range(sys.maxunicode)]
>>> control_chars = ''.join(c for c in all_chars if unicodedata.category(c) == 'Cc')
>>> expanded_class = ''.join(c for c in all_chars if re.match(r'[\x00-\x1f\x7f-\x9f]', c))
>>> control_chars == expanded_class
True

因此，要使用re刪除控制字符，只需使用以下內容：

>>> re.sub(r'[\x00-\x1f\x7f-\x9f]', '', 'abc\02de')
'abcde'

Answer 4

這是我所知道的最簡單、最完整、最可靠的方法。 但是，它確實需要外部依賴。 我認為對於大多數項目來說這是值得的。

pip install regex

import regex as rx
def remove_control_characters(str):
    return rx.sub(r'\p{C}', '', 'my-string')

\\p{C}是控制字符的unicode 字符屬性，因此您可以將它留給 unicode 聯盟，數百萬可用 unicode 字符中的哪些應該被視為控制。 我還經常使用其他非常有用的字符屬性，例如\\p{Z}用於任何類型的空白。

Answer 5

您的實現是錯誤的，因為i的值不正確。 然而，這並不是唯一的問題：它還反復使用慢速字符串操作，這意味着它在 O(n ² ) 而不是 O(n) 中運行。 試試這個：

return ''.join(c for c in line if ord(c) >= 32)

Answer 6

對於 Python 2，使用內置translate ：

import string
all_bytes = string.maketrans('', '')  # String of 256 characters with (byte) value 0 to 255

line.translate(all_bytes, all_bytes[:32])  # All bytes < 32 are deleted (the second argument lists the bytes to delete)

Answer 7

您在迭代期間修改該行。 類似''.join([x for x in line if ord(x) >= 32])

Answer 8

filter(string.printable[:-5].__contains__,line)

Answer 9

我已經嘗試了以上所有方法，但沒有幫助。 就我而言，我必須刪除 Unicode 'LRM' 字符：

最后我找到了這個解決方案：

df["AMOUNT"] = df["AMOUNT"].str.encode("ascii", "ignore")
df["AMOUNT"] = df["AMOUNT"].str.decode('UTF-8')

參考這里。

從 python 中的字符串中刪除控制字符

問題描述

9 個解決方案

解決方案1
153 已采納 2013-09-25 22:17:35

解決方案2
30 2010-12-01 13:30:31

解決方案3
16 2016-09-09 16:37:26

解決方案4
12 2019-01-16 20:57:26

解決方案5
8 2010-12-01 13:31:50

解決方案6
7 2010-12-01 16:02:29

解決方案7
2 2010-12-01 13:33:31

解決方案8
2 2010-12-01 15:02:45

解決方案9
0 2021-10-05 10:50:57

從 python 中的字符串中刪除控制字符

問題描述

9 個解決方案

解決方案1 153 已采納 2013-09-25 22:17:35

解決方案2 30 2010-12-01 13:30:31

解決方案3 16 2016-09-09 16:37:26

解決方案4 12 2019-01-16 20:57:26

解決方案5 8 2010-12-01 13:31:50

解決方案6 7 2010-12-01 16:02:29

解決方案7 2 2010-12-01 13:33:31

解決方案8 2 2010-12-01 15:02:45

解決方案9 0 2021-10-05 10:50:57

解決方案1
153 已采納 2013-09-25 22:17:35

解決方案2
30 2010-12-01 13:30:31

解決方案3
16 2016-09-09 16:37:26

解決方案4
12 2019-01-16 20:57:26

解決方案5
8 2010-12-01 13:31:50

解決方案6
7 2010-12-01 16:02:29

解決方案7
2 2010-12-01 13:33:31

解決方案8
2 2010-12-01 15:02:45

解決方案9
0 2021-10-05 10:50:57