简体   繁体   English

在 Python unicode 字符串中删除重音(规范化)的最佳方法是什么?

[英]What is the best way to remove accents (normalize) in a Python unicode string?

I have a Unicode string in Python, and I would like to remove all the accents (diacritics).我在 Python 中有一个 Unicode 字符串,我想删除所有的重音符号(变音符号)。

I found on the web an elegant way to do this (in Java):我在网上找到了一种优雅的方法(在 Java 中):

  1. convert the Unicode string to its long normalized form (with a separate character for letters and diacritics)将 Unicode 字符串转换为其长规范化形式(字母和变音符号使用单独的字符)
  2. remove all the characters whose Unicode type is "diacritic".删除所有 Unicode 类型为“变音符号”的字符。

Do I need to install a library such as pyICU or is this possible with just the Python standard library?我是否需要安装诸如 pyICU 之类的库,或者仅使用 Python 标准库就可以做到这一点? And what about python 3?那么python 3呢?

Important note: I would like to avoid code with an explicit mapping from accented characters to their non-accented counterpart.重要提示:我想避免使用从重音字符到非重音字符的显式映射的代码。

Unidecode is the correct answer for this. Unidecode是这个问题的正确答案。 It transliterates any unicode string into the closest possible representation in ascii text.它将任何 unicode 字符串转译为 ascii 文本中最接近的可能表示形式。

Example:例子:

accented_string = u'Málaga'
# accented_string is of type 'unicode'
import unidecode
unaccented_string = unidecode.unidecode(accented_string)
# unaccented_string contains 'Malaga'and is of type 'str'
["

How about this:<\/i>这个怎么样:<\/b><\/p>

import unicodedata
def strip_accents(s):
   return ''.join(c for c in unicodedata.normalize('NFD', s)
                  if unicodedata.category(c) != 'Mn')

I just found this answer on the Web:我刚刚在网上找到了这个答案:

import unicodedata

def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    only_ascii = nfkd_form.encode('ASCII', 'ignore')
    return only_ascii

It works fine (for French, for example), but I think the second step (removing the accents) could be handled better than dropping the non-ASCII characters, because this will fail for some languages (Greek, for example).它工作得很好(例如法语),但我认为第二步(删除重音符号)比删除非 ASCII 字符处理得更好,因为这对于某些语言(例如希腊语)会失败。 The best solution would probably be to explicitly remove the unicode characters that are tagged as being diacritics.最好的解决方案可能是明确删除标记为变音符号的 unicode 字符。

Edit : this does the trick:编辑:这可以解决问题:

import unicodedata

def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])

unicodedata.combining(c) will return true if the character c can be combined with the preceding character, that is mainly if it's a diacritic.如果字符c可以与前面的字符组合,则unicodedata.combining(c)将返回 true,这主要是如果它是一个变音符号。

Edit 2 : remove_accents expects a unicode string, not a byte string.编辑 2remove_accents需要一个unicode字符串,而不是字节字符串。 If you have a byte string, then you must decode it into a unicode string like this:如果您有一个字节字符串,那么您必须将其解码为一个 unicode 字符串,如下所示:

encoding = "utf-8" # or iso-8859-15, or cp1252, or whatever encoding you use
byte_string = b"café"  # or simply "café" before python 3.
unicode_string = byte_string.decode(encoding)

Actually I work on project compatible python 2.6, 2.7 and 3.4 and I have to create IDs from free user entries.实际上,我在项目兼容的 python 2.6、2.7 和 3.4 上工作,我必须从免费用户条目中创建 ID。

Thanks to you, I have created this function that works wonders.多亏了你,我创造了这个奇迹般的功能。

import re
import unicodedata

def strip_accents(text):
    """
    Strip accents from input String.

    :param text: The input string.
    :type text: String.

    :returns: The processed String.
    :rtype: String.
    """
    try:
        text = unicode(text, 'utf-8')
    except (TypeError, NameError): # unicode is a default on python 3 
        pass
    text = unicodedata.normalize('NFD', text)
    text = text.encode('ascii', 'ignore')
    text = text.decode("utf-8")
    return str(text)

def text_to_id(text):
    """
    Convert input text to id.

    :param text: The input string.
    :type text: String.

    :returns: The processed String.
    :rtype: String.
    """
    text = strip_accents(text.lower())
    text = re.sub('[ ]+', '_', text)
    text = re.sub('[^0-9a-zA-Z_-]', '', text)
    return text

result:结果:

text_to_id("Montréal, über, 12.89, Mère, Françoise, noël, 889")
>>> 'montreal_uber_1289_mere_francoise_noel_889'

This handles not only accents, but also "strokes" (as in ø etc.):这不仅可以处理重音,还可以处理“笔画”(如 ø 等):

import unicodedata as ud

def rmdiacritics(char):
    '''
    Return the base character of char, by "removing" any
    diacritics like accents or curls and strokes and the like.
    '''
    desc = ud.name(char)
    cutoff = desc.find(' WITH ')
    if cutoff != -1:
        desc = desc[:cutoff]
        try:
            char = ud.lookup(desc)
        except KeyError:
            pass  # removing "WITH ..." produced an invalid name
    return char

This is the most elegant way I can think of (and it has been mentioned by alexis in a comment on this page), although I don't think it is very elegant indeed.这是我能想到的最优雅的方式(亚历克西斯在本页的评论中已经提到过),尽管我认为它确实不是很优雅。 In fact, it's more of a hack, as pointed out in comments, since Unicode names are – really just names, they give no guarantee to be consistent or anything.事实上,正如评论中所指出的,这更像是一种 hack,因为 Unicode 名称 - 实际上只是名称,它们不能保证是一致的或任何东西。

There are still special letters that are not handled by this, such as turned and inverted letters, since their unicode name does not contain 'WITH'.还有一些特殊的字母没有被这个处理,例如翻转和倒置的字母,因为它们的 unicode 名称不包含“WITH”。 It depends on what you want to do anyway.这取决于你想做什么。 I sometimes needed accent stripping for achieving dictionary sort order.我有时需要去除重音来实现字典排序顺序。

EDIT NOTE:编辑注:

Incorporated suggestions from the comments (handling lookup errors, Python-3 code).合并了评论中的建议(处理查找错误、Python-3 代码)。

In my view, the proposed solutions should NOT be accepted answers.在我看来,建议的解决方案不应被接受。 The original question is asking for the removal of accents , so the correct answer should only do that, not that plus other, unspecified, changes.最初的问题是要求去除口音,所以正确的答案应该只这样做,而不是加上其他未指定的更改。

Simply observe the result of this code which is the accepted answer.只需观察此代码的结果,即可接受的答案。 where I have changed "Málaga" by "Málagueña:我将“Málaga”更改为“Málagueña”:

accented_string = u'Málagueña'
# accented_string is of type 'unicode'
import unidecode
unaccented_string = unidecode.unidecode(accented_string)
# unaccented_string contains 'Malaguena'and is of type 'str'

There is an additional change (ñ -> n), which is not requested in the OQ.还有一个额外的更改 (ñ -> n),在 OQ 中没有要求。

A simple function that does the requested task, in lower form:执行请求任务的简单函数,格式如下:

def f_remove_accents(old):
    """
    Removes common accent characters, lower form.
    Uses: regex.
    """
    new = old.lower()
    new = re.sub(r'[àáâãäå]', 'a', new)
    new = re.sub(r'[èéêë]', 'e', new)
    new = re.sub(r'[ìíîï]', 'i', new)
    new = re.sub(r'[òóôõö]', 'o', new)
    new = re.sub(r'[ùúûü]', 'u', new)
    return new

In response to @MiniQuark's answer:回应@MiniQuark 的回答:

I was trying to read in a csv file that was half-French (containing accents) and also some strings which would eventually become integers and floats.我试图读取一个半法语(包含重音)的csv文件以及一些最终会变成整数和浮点数的字符串。 As a test, I created a test.txt file that looked like this:作为测试,我创建了一个如下所示的test.txt文件:

Montréal, über, 12.89, Mère, Françoise, noël, 889蒙特利尔, über, 12.89, Mère, Françoise, noël, 889

I had to include lines 2 and 3 to get it to work (which I found in a python ticket), as well as incorporate @Jabba's comment:我必须包含第2行和3行才能使其正常工作(我在 python 票证中找到了),并合并了 @Jabba 的评论:

import sys 
reload(sys) 
sys.setdefaultencoding("utf-8")
import csv
import unicodedata

def remove_accents(input_str):
    nkfd_form = unicodedata.normalize('NFKD', unicode(input_str))
    return u"".join([c for c in nkfd_form if not unicodedata.combining(c)])

with open('test.txt') as f:
    read = csv.reader(f)
    for row in read:
        for element in row:
            print remove_accents(element)

The result:结果:

Montreal
uber
12.89
Mere
Francoise
noel
889

(Note: I am on Mac OS X 10.8.4 and using Python 2.7.3) (注意:我在 Mac OS X 10.8.4 上使用 Python 2.7.3)

gensim.utils.deaccent(text) from Gensim - topic modelling for humans :来自 Gensim 的gensim.utils.deaccent(text) - 人类主题建模

'Sef chomutovskych komunistu dostal postou bily prasek'

Another solution is unidecode .另一种解决方案是unidecode

Note that the suggested solution with unicodedata typically removes accents only in some character (eg it turns 'ł' into '' , rather than into 'l' ).请注意,建议的unicodedata解决方案通常仅删除某些字符中的重音(例如,它将'ł'变成'' ,而不是变成'l' )。

性能图

import unicodedata
from random import choice

import perfplot
import regex
import text_unidecode


def remove_accent_chars_regex(x: str):
    return regex.sub(r'\p{Mn}', '', unicodedata.normalize('NFKD', x))


def remove_accent_chars_join(x: str):
    # answer by MiniQuark
    # https://stackoverflow.com/a/517974/7966259
    return u"".join([c for c in unicodedata.normalize('NFKD', x) if not unicodedata.combining(c)])


perfplot.show(
    setup=lambda n: ''.join([choice('Málaga François Phút Hơn 中文') for i in range(n)]),
    kernels=[
        remove_accent_chars_regex,
        remove_accent_chars_join,
        text_unidecode.unidecode,
    ],
    labels=['regex', 'join', 'unidecode'],
    n_range=[2 ** k for k in range(22)],
    equality_check=None, relative_to=0, xlabel='str len'
)

Some languages have combining diacritics as language letters and accent diacritics to specify accent.一些语言结合了变音符号作为语言字母和重音变音符号来指定重音。

I think it is more safe to specify explicitly what diactrics you want to strip:我认为明确指定要删除的变音符号更安全:

def strip_accents(string, accents=('COMBINING ACUTE ACCENT', 'COMBINING GRAVE ACCENT', 'COMBINING TILDE')):
    accents = set(map(unicodedata.lookup, accents))
    chars = [c for c in unicodedata.normalize('NFD', string) if c not in accents]
    return unicodedata.normalize('NFC', ''.join(chars))

If you are hoping to get functionality similar to Elasticsearch's asciifolding filter, you might want to consider fold-to-ascii , which is [itself]...如果您希望获得类似于 Elasticsearch 的asciifolding过滤器的功能,您可能需要考虑fold-to-ascii ,它是 [本身]...

A Python port of the Apache Lucene ASCII Folding Filter that converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into ASCII equivalents, if they exist. Apache Lucene ASCII 折叠过滤器的 Python 端口,它将不在前 127 个 ASCII 字符(“基本拉丁语”Unicode 块)中的字母、数字和符号 Unicode 字符(如果存在)转换为 ASCII 等效字符。

Here's an example from the page mentioned above:这是上述页面中的一个示例:

from fold_to_ascii import fold
s = u'Astroturf® paté'
fold(s)
> u'Astroturf pate'
fold(s, u'?')
> u'Astroturf? pate'

EDIT : The fold_to_ascii module seems to work well for normalizing Latin-based alphabets;编辑fold_to_ascii模块似乎适用于标准化拉丁字母; however unmappable characters are removed, which means that this module will reduce Chinese text, for example, to empty strings.但是无法映射的字符被删除了,这意味着该模块将减少中文文本,例如,为空字符串。 If you want to preserve Chinese, Japanese, and other Unicode alphabets, consider using @mo-han's remove_accent_chars_regex implementation, above.如果您想保留中文、日文和其他 Unicode 字母,请考虑使用上面的 @mo-han 的remove_accent_chars_regex实现。

Here is a short function which strips the diacritics, but keeps the non-latin characters.这是一个去除变音符号但保留非拉丁字符的简短函数。 Most cases (eg, "à" -> "a" ) are handled by unicodedata (standard library), but several (eg, "æ" -> "ae" ) rely on the given parallel strings.大多数情况(例如"à" -> "a" )由unicodedata (标准库)处理,但有几个(例如"æ" -> "ae" )依赖于给定的并行字符串。

Code代码

from unicodedata import combining, normalize

LATIN = "ä  æ  ǽ  đ ð ƒ ħ ı ł ø ǿ ö  œ  ß  ŧ ü "
ASCII = "ae ae ae d d f h i l o o oe oe ss t ue"

def remove_diacritics(s, outliers=str.maketrans(dict(zip(LATIN.split(), ASCII.split())))):
    return "".join(c for c in normalize("NFD", s.lower().translate(outliers)) if not combining(c))

NB.注意。 The default argument outliers is evaluated once and not meant to be provided by the caller.默认参数outliers被评估一次,并不意味着由调用者提供。

Intended usage预期用途

As a key to sort a list of strings in a more “natural” order:作为以更“自然”的顺序对字符串列表进行排序的关键:

sorted(['cote', 'coteau', "crottez", 'crotté', 'côte', 'côté'], key=remove_diacritics)

Output:输出:

['cote', 'côte', 'côté', 'coteau', 'crotté', 'crottez']

If your strings mix texts and numbers, you may be interested in composing remove_diacritics() with the function string_to_pairs() I give elsewhere .如果您的字符串混合了文本和数字,您可能会对使用我在其他地方提供的函数string_to_pairs()组合remove_diacritics()感兴趣。

Tests测试

To make sure the behavior meets your needs, take a look at the pangrams below:为确保行为满足您的需求,请查看以下pangrams

examples = [
    ("hello, world", "hello, world"),
    ("42", "42"),
    ("你好,世界", "你好,世界"),
    (
        "Dès Noël, où un zéphyr haï me vêt de glaçons würmiens, je dîne d’exquis rôtis de bœuf au kir, à l’aÿ d’âge mûr, &cætera.",
        "des noel, ou un zephyr hai me vet de glacons wuermiens, je dine d’exquis rotis de boeuf au kir, a l’ay d’age mur, &caetera.",
    ),
    (
        "Falsches Üben von Xylophonmusik quält jeden größeren Zwerg.",
        "falsches ueben von xylophonmusik quaelt jeden groesseren zwerg.",
    ),
    (
        "Љубазни фењерџија чађавог лица хоће да ми покаже штос.",
        "љубазни фењерџија чађавог лица хоће да ми покаже штос.",
    ),
    (
        "Ljubazni fenjerdžija čađavog lica hoće da mi pokaže štos.",
        "ljubazni fenjerdzija cadavog lica hoce da mi pokaze stos.",
    ),
    (
        "Quizdeltagerne spiste jordbær med fløde, mens cirkusklovnen Walther spillede på xylofon.",
        "quizdeltagerne spiste jordbaer med flode, mens cirkusklovnen walther spillede pa xylofon.",
    ),
    (
        "Kæmi ný öxi hér ykist þjófum nú bæði víl og ádrepa.",
        "kaemi ny oexi her ykist þjofum nu baedi vil og adrepa.",
    ),
    (
        "Glāžšķūņa rūķīši dzērumā čiepj Baha koncertflīģeļu vākus.",
        "glazskuna rukisi dzeruma ciepj baha koncertfligelu vakus.",
    )
]

for (given, expected) in examples:
    assert remove_diacritics(given) == expected

Case-preserving variant保案变体

LATIN = "ä  æ  ǽ  đ ð ƒ ħ ı ł ø ǿ ö  œ  ß  ŧ ü  Ä  Æ  Ǽ  Đ Ð Ƒ Ħ I Ł Ø Ǿ Ö  Œ  SS Ŧ Ü "
ASCII = "ae ae ae d d f h i l o o oe oe ss t ue AE AE AE D D F H I L O O OE OE SS T UE"

def remove_diacritics(s, outliers=str.maketrans(dict(zip(LATIN.split(), ASCII.split())))):
    return "".join(c for c in normalize("NFD", s.translate(outliers)) if not combining(c))

There are already many answers here, but this was not previously considered: using sklearn这里已经有很多答案了,但是之前没有考虑过这个:using sklearn

from sklearn.feature_extraction.text import strip_accents_ascii, strip_accents_unicode

accented_string = u'Málagueña®'

print(strip_accents_unicode(accented_string)) # output: Malaguena®
print(strip_accents_ascii(accented_string)) # output: Malaguena

This is particularly useful if you are already using sklearn to process text.如果您已经在使用 sklearn 处理文本,这将特别有用。 Those are the functions internally called by classes like CountVectorizer to normalize strings: when using strip_accents='ascii' then strip_accents_ascii is called and when strip_accents='unicode' is used, then strip_accents_unicode is called.这些是CountVectorizer等类在内部调用的函数以规范化字符串:当使用strip_accents='ascii'时,调用strip_accents_ascii ,而当使用strip_accents='unicode'时,调用strip_accents_unicode

More details更多细节

Finally, consider those details from its docstring:最后,从其文档字符串中考虑这些细节:

Signature: strip_accents_ascii(s)
Transform accentuated unicode symbols into ascii or nothing

Warning: this solution is only suited for languages that have a direct
transliteration to ASCII symbols.

and

Signature: strip_accents_unicode(s)
Transform accentuated unicode symbols into their simple counterpart

Warning: the python-level loop and join operations make this
implementation 20 times slower than the strip_accents_ascii basic
normalization.

You can do it the following way, after installing the unidecode library:安装 unidecode 库后,您可以通过以下方式执行此操作:

import unidecode
normalize_text = unidecode.unidecode
x = 'ÁâúuùÚ'
normalized_x = normalize_text(x)
print(normalized_x)
# 'AaúuuU'

The result will be: 'AaúuuU', as shown in the last row.结果将是:'AaúuuU',如最后一行所示。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在 PySpark 中使用 Apache Spark 数据帧删除重音的最佳方法是什么? - What is the best way to remove accents with Apache Spark dataframes in PySpark? 什么是在python中确定unicode字符串的解码方法的最佳方法 - What's the best way to determine decoding method of a unicode string in python 在python中切片unicode字符串的正确方法是什么? - What is the correct way to slice a unicode string in python? 如何从 python 中的字符串中删除重音符号 - how to remove accents from a string in python 在csv.DictWriter中的python中编写unicode的最佳方法是什么 - What is the best way to write unicode in python in csv.DictWriter 这是确保在utf-8中编码python unicode“string”的最佳方法吗? - Is this the best way to ensure that a python unicode “string” is encoded in utf-8? 在 Python 中对普通字符串和 Unicode 空字符串进行“非无”测试的最佳方法? - Best way to do a "not None" test in Python for a normal and Unicode empty string? 从python 3中的字符串中删除unicode表示的最简单方法? - Easiest way to remove unicode representations from a string in python 3? 删除 python 中的重音符号 - Remove accents in python 在Python中分解此字符串的最佳方法是什么? - What is the best way to break up this string in Python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM