用一個空格替換非 ASCII 字符

Question

我需要用空格替換所有非 ASCII (\x00-\x7F) 字符。 我很驚訝這在 Python 中並不容易，除非我遺漏了一些東西。 以下函數只是刪除所有非 ASCII 字符：

def remove_non_ascii_1(text):

    return ''.join(i for i in text if ord(i)<128)

並且這個根據字符代碼點中的字節數將非ASCII字符替換為空格數（即–字符替換為3個空格）：

def remove_non_ascii_2(text):

    return re.sub(r'[^\x00-\x7F]',' ', text)

如何用一個空格替換所有非 ASCII 字符？

在無數類似的 SO 問題中，沒有一個解決字符替換而不是剝離，並且還解決了所有非 ascii 字符而不是特定字符。

Answer 1

您''.join()表達式正在過濾，刪除任何非 ASCII； 您可以改用條件表達式：

return ''.join([i if ord(i) < 128 else ' ' for i in text])

這將逐個處理字符，並且每個替換的字符仍將使用一個空格。

您的正則表達式應該只用空格替換連續的非 ASCII 字符：

re.sub(r'[^\x00-\x7F]+',' ', text)

注意那里的+ 。

Answer 2

為了您獲得原始字符串的最相似表示，我推薦使用 unidecode 模塊：

# python 2.x:
from unidecode import unidecode
def remove_non_ascii(text):
    return unidecode(unicode(text, encoding = "utf-8"))

然后你可以在字符串中使用它：

remove_non_ascii("Ceñía")
Cenia

Answer 3

對於字符處理，使用 Unicode 字符串：

PythonWin 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AMD64)] on win32.
>>> s='ABC馬克def'
>>> import re
>>> re.sub(r'[^\x00-\x7f]',r' ',s)   # Each char is a Unicode codepoint.
'ABC  def'
>>> b = s.encode('utf8')
>>> re.sub(rb'[^\x00-\x7f]',rb' ',b) # Each char is a 3-byte UTF-8 sequence.
b'ABC      def'

但請注意，如果您的字符串包含分解的 Unicode 字符（例如，單獨的字符和組合的重音符號），您仍然會遇到問題：

>>> s = 'mañana'
>>> len(s)
6
>>> import unicodedata as ud
>>> n=ud.normalize('NFD',s)
>>> n
'mañana'
>>> len(n)
7
>>> re.sub(r'[^\x00-\x7f]',r' ',s) # single codepoint
'ma ana'
>>> re.sub(r'[^\x00-\x7f]',r' ',n) # only combining mark replaced
'man ana'

Answer 4

如果替換字符可以是“？” 而不是空格，那么我建議result = text.encode('ascii', 'replace').decode() ：

"""Test the performance of different non-ASCII replacement methods."""


import re
from timeit import timeit


# 10_000 is typical in the project that I'm working on and most of the text
# is going to be non-ASCII.
text = 'Æ' * 10_000


print(timeit(
    """
result = ''.join([c if ord(c) < 128 else '?' for c in text])
    """,
    number=1000,
    globals=globals(),
))

print(timeit(
    """
result = text.encode('ascii', 'replace').decode()
    """,
    number=1000,
    globals=globals(),
))

結果：

0.7208260721400134
0.009975979187503592

Answer 5

這個如何？

def replace_trash(unicode_string):
     for i in range(0, len(unicode_string)):
         try:
             unicode_string[i].encode("ascii")
         except:
              #means it's non-ASCII
              unicode_string=unicode_string[i].replace(" ") #replacing it with a single space
     return unicode_string

Answer 6

作為一種本機且高效的方法，您不需要對字符使用ord或任何循環。 只需使用ascii編碼並忽略錯誤。

以下將僅刪除非 ascii 字符：

new_string = old_string.encode('ascii',errors='ignore')

現在，如果要替換已刪除的字符，只需執行以下操作：

final_string = new_string + b' ' * (len(old_string) - len(new_string))

Answer 7

當我們使用ascii()時，它會轉義非 ascii 字符，並且不會正確更改 ascii 字符。 所以我的主要想法是，它不會更改 ASCII 字符，所以我正在遍歷字符串並檢查字符是否已更改。 如果它改變了，那么用替換器替換它，你給什么。
例如：' '（單個空格）或 '?' （帶問號）。

def remove(x, replacer):

     for i in x:
        if f"'{i}'" == ascii(i):
            pass
        else:
            x=x.replace(i,replacer)
     return x
remove('hái',' ')

結果：“hi”（中間有一個空格）。

語法： remove(str,non_ascii_replacer)
str = 在這里你將給出你想要使用的字符串。
non_ascii_replacer = 在這里，您將給出要替換所有非 ASCII 字符的替換器。

Answer 8

使用Raku （以前稱為 Perl_6）進行預處理

~$ raku -pe 's:g/ <:!ASCII>+ / /;' file

樣本輸入：

Peace be upon you
السلام عليكم
שלום עליכם
Paz sobre vosotros

樣本輸出：

Peace be upon you


Paz sobre vosotros

請注意，您可以使用以下代碼獲取有關匹配項的大量信息：

~$ raku -ne 'say s:g/ <:!ASCII>+ / /.raku;' file
$( )
$(Match.new(:orig("السلام عليكم"), :from(0), :pos(6)), Match.new(:orig("السلام عليكم"), :from(7), :pos(12)))
$(Match.new(:orig("שלום עליכם"), :from(0), :pos(4)), Match.new(:orig("שלום עליכם"), :from(5), :pos(10)))
$( )
$( )

或者更簡單地說，您可以可視化替換空格：

~$ raku -ne 'say S:g/ <:!ASCII>+ / /.raku;' file
"Peace be upon you"
"   "
"   "
"Paz sobre vosotros"
""

https://docs.raku.org/language/regexes#Unicode_properties
https://www.codesections.com/blog/raku-unicode/
https://raku.org

Answer 9

您還可以：

from string import ascii_letters

然后只是：

new_string = ' '.join([s for s in string if s in ascii_letters])

Answer 10

我的問題是我的字符串包含諸如BelgiÃ代表 België 和&#x20AC代表€ 符號。 而且我不想用空格替換它們。 但是帶有正確的符號本身。

我的解決方案是string.encode('Latin1').decode('utf-8')

Answer 11

可能是針對不同的問題，但我提供了我的@Alvero 答案版本（使用unidecode）。 我想在我的字符串上做一個“常規”條帶，即我的字符串的開頭和結尾用於空白字符，然后只用“常規”空格替換其他空白字符，即

"Ceñíaㅤmañanaㅤㅤㅤㅤ"

至

"Ceñía mañana"

,

def safely_stripped(s: str):
    return ' '.join(
        stripped for stripped in
        (bit.strip() for bit in
         ''.join((c if unidecode(c) else ' ') for c in s).strip().split())
        if stripped)

我們首先將所有非 unicode 空格替換為常規空格（然后重新加入），

''.join((c if unidecode(c) else ' ') for c in s)

然后我們再次拆分它，使用 python 的正常拆分，並剝離每個“位”，

(bit.strip() for bit in s.split())

最后再次加入它們，但前提是字符串通過了if測試，

' '.join(stripped for stripped in s if stripped)

有了這個， safely_stripped('ㅤㅤㅤㅤCeñíaㅤmañanaㅤㅤㅤㅤ')正確返回'Ceñía mañana' 。

Answer 12

用空格替換所有非 ASCII (\x00-\x7F) 字符：

''.join(map(lambda x: x if ord(x) in range(0, 128) else ' ', text))

要替換所有可見字符，請嘗試以下操作：

import string

''.join(map(lambda x: x if x in string.printable and x not in string.whitespace else ' ', text))

這將給出相同的結果：

''.join(map(lambda x: x if ord(x) in range(32, 128) else ' ', text))

用一個空格替換非 ASCII 字符

問題描述

12 個解決方案

解決方案1
291 已采納 2013-11-19 18:11:35

解決方案2
68 2016-02-18 20:50:55

解決方案3
26 2013-11-19 18:29:14

解決方案4
15 2017-01-03 06:31:18

解決方案5
9 2016-08-20 22:35:18

解決方案6
7 2018-01-23 14:39:32

解決方案7
2 2020-12-22 08:48:46

解決方案8
1 2022-06-19 02:41:00

解決方案9
1 2022-07-17 18:49:37

解決方案10
0 2021-06-10 10:21:24

解決方案11
-1 2019-04-08 15:03:03

解決方案12
-1 2021-12-06 21:01:54

用一個空格替換非 ASCII 字符

問題描述

12 個解決方案

解決方案1 291 已采納 2013-11-19 18:11:35

解決方案2 68 2016-02-18 20:50:55

解決方案3 26 2013-11-19 18:29:14

解決方案4 15 2017-01-03 06:31:18

解決方案5 9 2016-08-20 22:35:18

解決方案6 7 2018-01-23 14:39:32

解決方案7 2 2020-12-22 08:48:46

解決方案8 1 2022-06-19 02:41:00

解決方案9 1 2022-07-17 18:49:37

解決方案10 0 2021-06-10 10:21:24

解決方案11 -1 2019-04-08 15:03:03

解決方案12 -1 2021-12-06 21:01:54

解決方案1
291 已采納 2013-11-19 18:11:35

解決方案2
68 2016-02-18 20:50:55

解決方案3
26 2013-11-19 18:29:14

解決方案4
15 2017-01-03 06:31:18

解決方案5
9 2016-08-20 22:35:18

解決方案6
7 2018-01-23 14:39:32

解決方案7
2 2020-12-22 08:48:46

解決方案8
1 2022-06-19 02:41:00

解決方案9
1 2022-07-17 18:49:37

解決方案10
0 2021-06-10 10:21:24

解決方案11
-1 2019-04-08 15:03:03

解決方案12
-1 2021-12-06 21:01:54