简体   繁体   English

如何从Python字符串中删除unicode“标点符号”

[英]How to strip unicode “punctuation” from Python string

Here's the problem, I have a unicode string as input to a python sqlite query. 这是问题,我有一个unicode字符串作为python sqlite查询的输入。 The query failed ('like'). 查询失败('like')。 It turns out the string, 'FRANCE' doesn't have 6 characters, it has seven. 结果是字符串,'FRANCE'没有6个字符,它有7个字符。 And the seventh is . 第七是。 . . unicode U+FEFF, a zero-width no-break space. unicode U + FEFF,零宽度不间断空间。

How on earth do I trap a class of such things before the query? 我怎么在查询之前捕获一类这样的东西呢?

You may use the unicodedata categories as part of the unicode data table in Python: 您可以将unicodedata类别用作Python中unicode数据表的一部分:

>>> unicodedata.category(u'a')
'Ll'
>>> unicodedata.category(u'.')
'Po'
>>> unicodedata.category(u',')
'Po'

The categories for punctation characters start with 'P' as you can see. 正如您所见,标点符号的类别以“P”开头。 So you need to filter you out char by char (using a list comprehension). 所以你需要通过char过滤掉char(使用列表推导)。

See also: 也可以看看:

in your case : 在你的情况下:

>>> unicodedata.category(u'\ufeff')
'Cf'

So you may perform some whitelisting based on the categories for characters. 因此,您可以根据字符的类别执行一些白名单。

In general, input validation should be done by using a whitelist of allowable characters if you can define such a thing for your use case. 通常,如果您可以为您的用例定义此类内容,则应使用允许字符的白名单来完成输入验证。 Then you simply throw out anything that isn't on the whitelist (or reject the input altogether). 然后,您只需丢弃任何不在白名单上的内容(或完全拒绝输入)。

If you can define a set of allowed characters, then you can use a regular expression to strip out everything else. 如果您可以定义一组允许的字符,那么您可以使用正则表达式去除其他所有字符。

For example, lets say you know "country" will only have upper-case English letters and spaces you could strip out everything else, including your nasty unicode letter like this: 例如,假设您知道“country”将只有大写的英文字母和空格,您可以删除其他所有内容,包括您讨厌的unicode字母,如下所示:

>>> import re
>>> country = u'FRANCE\ufeff'
>>> clean_pattern = re.compile(u'[^A-Z ]+')
>>> clean_pattern.sub('', country)
u'FRANCE'

If you can't define a set of allowed characters, you're in deep trouble, because it becomes your task to anticipate all tens of thousands of possible unexpected unicode characters that could be thrown at you--and more and more are added to the specs as languages evolve over the years. 如果你不能定义一组允许的字符,你就会陷入深深的麻烦,因为预测所有可能被抛出的数以万计的意外unicode字符成为你的任务 - 并且越来越多的被添加到随着语言的发展,这些规范多年来不断发展。

That's also the byte-order mark, BOM. 这也是字节顺序标记BOM。 Just cleanup your strings first to eliminate those, using something like: 首先清理你的字符串,以消除这些,使用类似的东西:


>>> f = u'France\ufeff'
>>> f
u'France\ufeff'
>>> print f
France
>>> f.replace(u'\ufeff', '')
u'France'
>>> f.strip(u'\ufeff')
u'France'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM