简体   繁体   English

Python泛型函数替换特殊字符

[英]Python generic function to replace special characters

I have been looking for a while now but I am not able to find a proper solution. 我一直在寻找一段时间,但我无法找到合适的解决方案。

I have a database with Dutch, French and German words which all have their special characters. 我有一个荷兰语,法语和德语单词的数据库,都有自己的特殊字符。 eg é , è , ß , ç , etc... 例如éèßç等......
For some cases, like in a url, I would like to replace these with alphanumeric characters. 对于某些情况,例如在网址中,我想用字母数字字符替换它们。 respectively e , e , ss , c , etc... 分别是eessc等......

Is there a generic function or Python package that does this? 是否有通用函数或Python包来执行此操作?

I could do this with Regex of course, but something generic would be great here. 当然,我可以用正则Regex做到这一点,但通用的东西在这里会很棒。

Thanks. 谢谢。

try this package: https://pypi.python.org/pypi/Unidecode 试试这个包: https//pypi.python.org/pypi/Unidecode

>>> import unidecode
>>> unidecode.unidecode(u'çß')
'css'

As you say, this could be done using a Regex sub . 如你所说,这可以使用Regex sub You would of course need to include upper and lowercase variants. 您当然需要包含大写和小写变体。

import re

data = "é, è, ß, ç, äÄ"
lookup = {'é':'e', 'è':'e', 'ß':'ss', 'ç':'c', 'ä':'a', 'Ä':'A'}
print re.sub(r'([éèßçäÄ])', lambda x: lookup[x.group(1)], data)

This would display the following: 这将显示以下内容:

e, e, ss, c, aA

you can almost get away with the builtin unicode data (unfortunately a few of your characters break it) 你几乎可以使用内置的unicode数据(不幸的是你的一些角色会破坏它)

>>> import unicodedata
>>> s=u"é, è, ß, ç"
>>> unicodedata.normalize('NFKD', s).encode('ascii', 'ignore')
'e, e, , c'

here is a solution that has the codepoints hardcoded stolen from http://code.activestate.com/recipes/251871-latin1-to-ascii-the-unicode-hammer/ 这是一个解决方案,从http://code.activestate.com/recipes/251871-latin1-to-ascii-the-unicode-hammer/窃取硬编码的代码点

def latin1_to_ascii (unicrap):
    """This takes a UNICODE string and replaces Latin-1 characters with
        something equivalent in 7-bit ASCII. It returns a plain ASCII string. 
        This function makes a best effort to convert Latin-1 characters into 
        ASCII equivalents. It does not just strip out the Latin-1 characters.
        All characters in the standard 7-bit ASCII range are preserved. 
        In the 8th bit range all the Latin-1 accented letters are converted 
        to unaccented equivalents. Most symbol characters are converted to 
        something meaningful. Anything not converted is deleted.
    """
    xlate={0xc0:'A', 0xc1:'A', 0xc2:'A', 0xc3:'A', 0xc4:'A', 0xc5:'A',
        0xc6:'Ae', 0xc7:'C',
        0xc8:'E', 0xc9:'E', 0xca:'E', 0xcb:'E',
        0xcc:'I', 0xcd:'I', 0xce:'I', 0xcf:'I',
        0xd0:'Th', 0xd1:'N',
        0xd2:'O', 0xd3:'O', 0xd4:'O', 0xd5:'O', 0xd6:'O', 0xd8:'O',
        0xd9:'U', 0xda:'U', 0xdb:'U', 0xdc:'U',
        0xdd:'Y', 0xde:'th', 0xdf:'ss',
        0xe0:'a', 0xe1:'a', 0xe2:'a', 0xe3:'a', 0xe4:'a', 0xe5:'a',
        0xe6:'ae', 0xe7:'c',
        0xe8:'e', 0xe9:'e', 0xea:'e', 0xeb:'e',
        0xec:'i', 0xed:'i', 0xee:'i', 0xef:'i',
        0xf0:'th', 0xf1:'n',
        0xf2:'o', 0xf3:'o', 0xf4:'o', 0xf5:'o', 0xf6:'o', 0xf8:'o',
        0xf9:'u', 0xfa:'u', 0xfb:'u', 0xfc:'u',
        0xfd:'y', 0xfe:'th', 0xff:'y',
        0xa1:'!', 0xa2:'{cent}', 0xa3:'{pound}', 0xa4:'{currency}',
        0xa5:'{yen}', 0xa6:'|', 0xa7:'{section}', 0xa8:'{umlaut}',
        0xa9:'{C}', 0xaa:'{^a}', 0xab:'<<', 0xac:'{not}',
        0xad:'-', 0xae:'{R}', 0xaf:'_', 0xb0:'{degrees}',
        0xb1:'{+/-}', 0xb2:'{^2}', 0xb3:'{^3}', 0xb4:"'",
        0xb5:'{micro}', 0xb6:'{paragraph}', 0xb7:'*', 0xb8:'{cedilla}',
        0xb9:'{^1}', 0xba:'{^o}', 0xbb:'>>', 
        0xbc:'{1/4}', 0xbd:'{1/2}', 0xbe:'{3/4}', 0xbf:'?',
        0xd7:'*', 0xf7:'/'
        }

    r = ''
    for i in unicrap:
        if xlate.has_key(ord(i)):
            r += xlate[ord(i)]
        elif ord(i) >= 0x80:
            pass
        else:
            r += str(i)
    return r

of coarse you could just as easily use a regex as indicated in the other answers 粗略你可以像其他答案中指出的那样轻松使用正则表达式

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM