简体   繁体   English

从字符串中删除特殊字符

[英]remove special characters from string

我有一个字符串“MikaelHåfström”,其中包含一些特殊字符如何使用python删除它?

You can use the unicodedata module to normalize unicode strings and encode them in their ASCII form like so: 您可以使用unicodedata模块来规范化unicode字符串并以ASCII形式对其进行编码,如下所示:

>>> import unicodedata
>>> source = u'Mikael Håfström'
>>> unicodedata.normalize('NFKD', source).encode('ascii', 'ignore')
'Mikael Hafstrom'

One notable exception is that the letters 'đ' and 'Đ' are not recognized by Python and they do not get encoded to 'd', so they will simply be omitted from the result. 一个值得注意的例外是Python不识别字母'đ'和'Đ',它们不会被编码为'd',因此它们将被简单地从结果中省略。 That's a voiced alveolo-palatal affricate present in the latin alphabet of some SEE languages, so it may or may not immediately concern you based on your audience or whether or not your providing full support for the Latin-1 character set. 这是一种在一些SEE语言的拉丁字母表中出现的浊音 - 腭语,因此它可能会或可能不会立即引起您的观众或您是否提供对Latin-1字符集的完全支持。 I currently have Python 2.6.5 (Mar 19 2010) running locally and the issue is present, though I'm sure it may have been resolved with newer releases. 我目前在本地运行Python 2.6.5(2010年3月19日)并且存在问题,但我确信它可能已经通过新版本解决了。

例如使用编码方法: u"Mikael Håfström".encode("ascii", "ignore")

See this effbot article (includes code). 请参阅此effbot文章 (包括代码)。 It makes reasonable transliterations into ASCII characters where possible. 它尽可能合理地将音译转换成ASCII字符。 It is possible to extend the built-in conversion table to handle many other characters (eg those used in Eastern European languages) that don't have a canonical decomposition. 可以扩展内置转换表以处理没有规范分解的许多其他字符(例如,在东欧语言中使用的字符)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM