用等效的 ASCII 替换特殊字符

Question

是否有任何可以将特殊字符替换为 ASCII 等效项的库，例如：

"Cześć"

到：

"Czesc"

我当然可以创建地图：

{'ś':'s', 'ć': 'c'}

并使用一些替换功能。 但是我不想将所有等价物硬编码到我的程序中，如果有一些函数已经这样做了。

Answer 1

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import unicodedata
text = u'Cześć'
print unicodedata.normalize('NFD', text).encode('ascii', 'ignore')

Answer 2

您可以通过以下方式获得大部分方法：

import unicodedata

def strip_accents(text):
    return ''.join(c for c in unicodedata.normalize('NFKD', text) if unicodedata.category(c) != 'Mn')

不幸的是，存在无法分解为 ASCII 字母 + 组合标记的重音拉丁字母。 您必须手动处理它们。 这些包括：

Æ → AE
Ð → D
Ø → O
Þ → TH
ß → ss
æ → ae
ð → d
ø → o
þ → th
Œ → OE
– → oe
ƒ → f

Answer 3

包unidecode最适合我：

from unidecode import unidecode
text = "Björn, Łukasz and Σωκράτης."
print(unidecode(text))
# ==> Bjorn, Lukasz and Sokrates.

您可能需要安装软件包：

pip install unidecode

如其他答案所建议的那样，上述解决方案比编码（和解码） unicodedata.normalize()的输出更容易、更可靠。

# This doesn't work as expected:
ret = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore')
print(ret)
# ==> b'Bjorn, ukasz and .'
# Besides not supporting all characters, the returned value is a
# bytes object in python3. To yield a str type:
ret = ret.decode("utf8") # (not required in python2)

Answer 4

我是这样做的：

POLISH_CHARACTERS = {
    50309:'a',50311:'c',50329:'e',50562:'l',50564:'n',50099:'o',50587:'s',50618:'z',50620:'z',
    50308:'A',50310:'C',50328:'E',50561:'L',50563:'N',50067:'O',50586:'S',50617:'Z',50619:'Z',}

def encodePL(text):
    nrmtxt = unicodedata.normalize('NFC',text)
    i = 0
    ret_str = []
    while i < len(nrmtxt):
        if ord(text[i])>128: # non ASCII character
            fbyte = ord(text[i])
            sbyte = ord(text[i+1])
            lkey = (fbyte << 8) + sbyte
            ret_str.append(POLISH_CHARACTERS.get(lkey))
            i = i+1
        else: # pure ASCII character
            ret_str.append(text[i])
        i = i+1
    return ''.join(ret_str)

执行时：

encodePL(u'ąćęłńóśźż ĄĆĘŁŃÓŚŹŻ')

它将产生如下输出：

u'acelnoszz ACELNOSZZ'

这对我来说很好用 - ;D

Answer 5

试试trans包。 看起来很有前途。 支持波兰语。

Answer 6

unicodedata.normalize 噱头最好用半assci 来描述。 这是一个强大的方法，其中包括一个没有分解的字母映射。 请注意注释中的附加地图条目。

用等效的 ASCII 替换特殊字符

问题描述

6 个解决方案

解决方案1
37 已采纳 2010-07-07 12:19:56

解决方案2
21 2010-07-12 06:10:54

解决方案3
8 2019-11-01 19:31:52

解决方案4
4 2012-04-06 13:43:49

解决方案5
4 2012-03-13 11:40:48

解决方案6
1 2010-07-12 07:13:14

用等效的 ASCII 替换特殊字符

问题描述

6 个解决方案

解决方案1 37 已采纳 2010-07-07 12:19:56

解决方案2 21 2010-07-12 06:10:54

解决方案3 8 2019-11-01 19:31:52

解决方案4 4 2012-04-06 13:43:49

解决方案5 4 2012-03-13 11:40:48

解决方案6 1 2010-07-12 07:13:14

解决方案1
37 已采纳 2010-07-07 12:19:56

解决方案2
21 2010-07-12 06:10:54

解决方案3
8 2019-11-01 19:31:52

解决方案4
4 2012-04-06 13:43:49

解决方案5
4 2012-03-13 11:40:48

解决方案6
1 2010-07-12 07:13:14