从Unicode字符串中去除特殊字符和标点符号

Question

I'm trying to remove the punctuation from a unicode string, which may contain non-ascii letters. 我正在尝试从可能包含非ASCII字母的Unicode字符串中删除标点符号。 I tried using the regex module: 我尝试使用regex模块：

import regex
text = u"<Üäik>"
regex.sub(ur"\p{P}+", "", text)

However, I've noticed that the characters < and > don't get removed. 但是，我注意到字符<和>不会被删除。 Does anyone know why and is there any other way to strip punctuation from unicode strings? 有谁知道为什么，还有其他方法可以从Unicode字符串中删除标点符号吗？

EDIT: Another approach I've tried out is doing: 编辑：我尝试过的另一种方法是：

import string
text = text.encode("utf8").translate(None, string.punctuation).decode("utf8")

but I would like to avoid converting the text from unicode to string and backwards. 但我想避免将文本从unicode转换为字符串和向后转换。

Answer 1

< and > are classified as Math Symbols (Sm) , not Punctuation (P). <和>归为数学符号（Sm）而不是标点符号（P）。 You can match either: 您可以匹配：

regex.sub('[\p{P}\p{Sm}]+', '', text)

The unicode.translate() method exists too and takes a dictionary mapping integer numbers (codepoints) to either other integer codepoints, a unicode character, or None ; unicode.translate()方法也存在，并且使用字典将整数（代码点）映射到其他整数代码点，unicode字符或None ； None removes that codepoint. None将删除该代码点。 Map string.punctuation to codepoints with ord() : 使用ord() string.punctuation到代码点：

text.translate(dict.fromkeys(ord(c) for c in string.punctuation))

That only removes only the limited number of ASCII punctuation characters. 这样只会删除有限数量的ASCII标点符号。

Demo: 演示：

>>> import regex
>>> text = u"<Üäik>"
>>> print regex.sub('[\p{P}\p{Sm}]+', '', text)
Üäik
>>> import string
>>> print text.translate(dict.fromkeys(ord(c) for c in string.punctuation))
Üäik

If string.punctuation is not enough, then you can generate a complete str.translate() mapping for all P and Sm codepoints by iterating from 0 to sys.maxunicode , then test those values against unicodedata.category() : 如果string.punctuation还不够，那么可以通过从0到sys.maxunicode迭代，为所有P和Sm代码点生成完整的str.translate()映射，然后针对unicodedata.category()测试这些值：

>>> import sys, unicodedata
>>> toremove = dict.fromkeys(i for i in range(0, sys.maxunicode + 1) if unicodedata.category(chr(i)).startswith(('P', 'Sm')))
>>> print text.translate(toremove)
Üäik

(For Python 3, replace unicode with str , and print ... with print(...)) . （对于Python 3，更换unicode与str ，并print ...与print(...))

Answer 2

Try string module 尝试string模块

import string,re
text = u"<Üäik>"
out = re.sub('[%s]' % re.escape(string.punctuation), '', text)
print out
print type(out)

Prints- Prints-

Üäik
<type 'unicode'>

Answer 3

\\p{P} matches punctuation characters. \\p{P}匹配标点符号。

Those punctuation characters are 这些标点符号是

! ' # S % & ' ( ) * + , - . / : ; < = > ? @ [ / ] ^ _ { | } ~

< and > are not punctuation characters. <和>不是标点符号。 So they won't be removed. 因此它们不会被删除。

Try this instead 试试这个

re.sub('[\p{L}<>]+',"",text)

从Unicode字符串中去除特殊字符和标点符号

问题描述

3 个解决方案

解决方案1
3 已采纳 2015-11-18 18:41:21

解决方案2
1 2015-11-18 18:26:04

解决方案3
0 2015-11-18 18:57:06

从Unicode字符串中去除特殊字符和标点符号

问题描述

3 个解决方案

解决方案1 3 已采纳 2015-11-18 18:41:21

解决方案2 1 2015-11-18 18:26:04

解决方案3 0 2015-11-18 18:57:06

解决方案1
3 已采纳 2015-11-18 18:41:21

解决方案2
1 2015-11-18 18:26:04

解决方案3
0 2015-11-18 18:57:06