简体   繁体   English

python-将字符串转换为unicode字符串

[英]python - conversion of a string to a unicode string

Im using a library unidecode to convert accentred strings to ascii represented stirngs. 我使用库unidecode将带重音符号的字符串转换为以ascii表示的字母。

>>> accented_string = u'Málaga'
# accented_string is of type 'unicode'
>>> import unidecode
>>> unidecode.unidecode(accented_string)
>>> Malaga

But the problem is I'm reading the string from a file how do I send it to the 'unidecode' library. 但是问题是我正在从文件中读取字符串,如何将其发送到“ unidecode”库。

for name in strings:
   print unidecode.unidecode(u+name) #?????

I can't get my head around it? 我无法解决这个问题? if I encode it that just gives me the wrong encoding. 如果我对它进行编码,那只会给我错误的编码。

We still don't know the type of your pandas column, so here are two versions for Python 2: 我们仍然不知道您的pandas列的类型,因此这是Python 2的两个版本:

  • If strings is already a sequence of Unicode strings ( type(name) is unicode ): 如果strings已经是Unicode字符串序列( type(name)unicode ):

     for name in strings: print unidecode.unidecode(name) 
  • If the elements of strings are regular Python 2 str ( type(name) is str ): 如果strings的元素是常规Python 2 strtype(name)str ):

     for name in strings: print unidecode.unidecode(name.decode("utf-8")) 

This will work _if your strings are stored in the UTF-8 encoding. 如果您的字符串以UTF-8编码存储,则可以使用。 Otherwise you'll have to supply the appropriate encoding, eg "latin-1" etc. 否则,您将必须提供适当的编码,例如"latin-1"等。

In Python 3, the first version should work; 在Python 3中,第一个版本应该可以运行。 you'll have to sort out your encoding issues before you get to this point, ie when you first read in your data from disk. 您必须先解决编码问题,然后才能开始操作,即,首次从磁盘读取数据时。

I have a work around which was too simple, just decode the read string back to a unicode string and then pass it to the 'unidecode' library. 我有一个解决方法,它太简单了,只需将读取的字符串解码回unicode字符串,然后将其传递给“ unidecode”库。

>>> accented_string = 'Málaga'
>>> accented_string_u = accented_string.decode('utf-8')
>>> import unidecode
>>> unidecode.unidecode(accented_string_u)
>>> Malaga

Use the unicodedata.normalize : 使用unicodedata.normalize

accented_string = u"Málaga"
unicodedata.normalize( "NFKD", accented_string ).encode( "ascii", "ignore" )

There are 4 normalized forms that you can use: "NFC", "NFKC", "NFD", and "NFKD". 您可以使用4种规格化形式:“ NFC”,“ NFKC”,“ NFD”和“ NFKD”。

Here is the details for using it as in the documentation linked above: 如上面链接的文档中所示,这是使用它的详细信息:

The Unicode standard defines various normalization forms of a Unicode string, based on the definition of canonical equivalence and compatibility equivalence. Unicode标准基于规范对等和兼容性对等的定义,定义了Unicode字符串的各种规范化形式。 In Unicode, several characters can be expressed in various way. 在Unicode中,可以用各种方式表示几个字符。 For example, the character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA). 例如,字符U + 00C7(带有CEDILLA的拉丁文大写字母C)也可以表示为序列U + 0043(拉丁文的大写字母C)U + 0327(合并CEDILLA)。

For each character, there are two normal forms: normal form C and normal form D. Normal form D (NFD) is also known as canonical decomposition, and translates each character into its decomposed form. 对于每个字符,有两种规范形式:规范形式C和规范形式D。规范形式D(NFD)也称为规范分解,将每个字符转换为其分解形式。 Normal form C (NFC) first applies a canonical decomposition, then composes pre-combined characters again. 范式C(NFC)首先应用规范分解,然后再次组成预组合字符。

In addition to these two forms, there are two additional normal forms based on compatibility equivalence. 除了这两种形式,还有基于兼容性对等的两种其他常规形式。 In Unicode, certain characters are supported which normally would be unified with other characters. 在Unicode中,支持某些字符,这些字符通常会与其他字符统一。 For example, U+2160 (ROMAN NUMERAL ONE) is really the same thing as U+0049 (LATIN CAPITAL LETTER I). 例如,U + 2160(罗马数字ONE)与U + 0049(拉丁大写字母I)实际上是同一回事。 However, it is supported in Unicode for compatibility with existing character sets (eg gb2312). 但是,Unicode支持它与现有字符集(例如gb2312)兼容。

The normal form KD (NFKD) will apply the compatibility decomposition, ie replace all compatibility characters with their equivalents. 普通格式KD(NFKD)将应用兼容性分解,即用所有等效字符替换它们的等效字符。 The normal form KC (NFKC) first applies the compatibility decomposition, followed by the canonical composition. 范式KC(NFKC)首先应用兼容性分解,然后进行规范组合。

Even if two unicode strings are normalized and look the same to a human reader, if one has combining characters and the other doesn't, they may not compare equal. 即使将两个unicode字符串归一化并在人类读者看来是相同的,但如果一个字符串包含组合字符而另一个字符串没有组合,则它们可能不相等。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM