如何从Python字符串中删除unicode“标点符号”

Question

Here's the problem, I have a unicode string as input to a python sqlite query. 这是问题，我有一个unicode字符串作为python sqlite查询的输入。 The query failed ('like'). 查询失败（'like'）。 It turns out the string, 'FRANCE' doesn't have 6 characters, it has seven. 结果是字符串，'FRANCE'没有6个字符，它有7个字符。 And the seventh is . 第七是。 . 。 . 。 unicode U+FEFF, a zero-width no-break space. unicode U + FEFF，零宽度不间断空间。

How on earth do I trap a class of such things before the query? 我怎么在查询之前捕获一类这样的东西呢？

Answer 1

You may use the unicodedata categories as part of the unicode data table in Python: 您可以将unicodedata类别用作Python中unicode数据表的一部分：

>>> unicodedata.category(u'a')
'Ll'
>>> unicodedata.category(u'.')
'Po'
>>> unicodedata.category(u',')
'Po'

The categories for punctation characters start with 'P' as you can see. 正如您所见，标点符号的类别以“P”开头。 So you need to filter you out char by char (using a list comprehension). 所以你需要通过char过滤掉char（使用列表推导）。

See also: 也可以看看：

in your case : 在你的情况下：

>>> unicodedata.category(u'\ufeff')
'Cf'

So you may perform some whitelisting based on the categories for characters. 因此，您可以根据字符的类别执行一些白名单。

Answer 2

In general, input validation should be done by using a whitelist of allowable characters if you can define such a thing for your use case. 通常，如果您可以为您的用例定义此类内容，则应使用允许字符的白名单来完成输入验证。 Then you simply throw out anything that isn't on the whitelist (or reject the input altogether). 然后，您只需丢弃任何不在白名单上的内容（或完全拒绝输入）。

If you can define a set of allowed characters, then you can use a regular expression to strip out everything else. 如果您可以定义一组允许的字符，那么您可以使用正则表达式去除其他所有字符。

For example, lets say you know "country" will only have upper-case English letters and spaces you could strip out everything else, including your nasty unicode letter like this: 例如，假设您知道“country”将只有大写的英文字母和空格，您可以删除其他所有内容，包括您讨厌的unicode字母，如下所示：

>>> import re
>>> country = u'FRANCE\ufeff'
>>> clean_pattern = re.compile(u'[^A-Z ]+')
>>> clean_pattern.sub('', country)
u'FRANCE'

If you can't define a set of allowed characters, you're in deep trouble, because it becomes your task to anticipate all tens of thousands of possible unexpected unicode characters that could be thrown at you--and more and more are added to the specs as languages evolve over the years. 如果你不能定义一组允许的字符，你就会陷入深深的麻烦，因为预测所有可能被抛出的数以万计的意外unicode字符成为你的任务 - 并且越来越多的被添加到随着语言的发展，这些规范多年来不断发展。

Answer 3

That's also the byte-order mark, BOM. 这也是字节顺序标记BOM。 Just cleanup your strings first to eliminate those, using something like: 首先清理你的字符串，以消除这些，使用类似的东西：


>>> f = u'France\ufeff'
>>> f
u'France\ufeff'
>>> print f
France
>>> f.replace(u'\ufeff', '')
u'France'
>>> f.strip(u'\ufeff')
u'France'

如何从Python字符串中删除unicode“标点符号”

问题描述

3 个解决方案

解决方案1
11 已采纳 2011-03-24 04:45:33

解决方案2
1 2011-03-24 04:56:01

解决方案3
0 2011-03-24 04:42:36

如何从Python字符串中删除unicode“标点符号”

问题描述

3 个解决方案

解决方案1 11 已采纳 2011-03-24 04:45:33

解决方案2 1 2011-03-24 04:56:01

解决方案3 0 2011-03-24 04:42:36

解决方案1
11 已采纳 2011-03-24 04:45:33

解决方案2
1 2011-03-24 04:56:01

解决方案3
0 2011-03-24 04:42:36