[英]Strip special characters and punctuation from a unicode string
I'm trying to remove the punctuation from a unicode string, which may contain non-ascii letters. 我正在尝试从可能包含非ASCII字母的Unicode字符串中删除标点符号。 I tried using the regex
module: 我尝试使用regex
模块:
import regex
text = u"<Üäik>"
regex.sub(ur"\p{P}+", "", text)
However, I've noticed that the characters <
and >
don't get removed. 但是,我注意到字符<
和>
不会被删除。 Does anyone know why and is there any other way to strip punctuation from unicode strings? 有谁知道为什么,还有其他方法可以从Unicode字符串中删除标点符号吗?
EDIT: Another approach I've tried out is doing: 编辑:我尝试过的另一种方法是:
import string
text = text.encode("utf8").translate(None, string.punctuation).decode("utf8")
but I would like to avoid converting the text from unicode to string and backwards. 但我想避免将文本从unicode转换为字符串和向后转换。
<
and >
are classified as Math Symbols (Sm) , not Punctuation (P). <
和>
归为数学符号(Sm)而不是标点符号(P)。 You can match either: 您可以匹配:
regex.sub('[\p{P}\p{Sm}]+', '', text)
The unicode.translate()
method exists too and takes a dictionary mapping integer numbers (codepoints) to either other integer codepoints, a unicode character, or None
; unicode.translate()
方法也存在,并且使用字典将整数(代码点)映射到其他整数代码点,unicode字符或None
; None
removes that codepoint. None
将删除该代码点。 Map string.punctuation
to codepoints with ord()
: 使用ord()
string.punctuation
到代码点:
text.translate(dict.fromkeys(ord(c) for c in string.punctuation))
That only removes only the limited number of ASCII punctuation characters. 这样只会删除有限数量的ASCII标点符号。
Demo: 演示:
>>> import regex
>>> text = u"<Üäik>"
>>> print regex.sub('[\p{P}\p{Sm}]+', '', text)
Üäik
>>> import string
>>> print text.translate(dict.fromkeys(ord(c) for c in string.punctuation))
Üäik
If string.punctuation
is not enough, then you can generate a complete str.translate()
mapping for all P
and Sm
codepoints by iterating from 0 to sys.maxunicode
, then test those values against unicodedata.category()
: 如果string.punctuation
还不够,那么可以通过从0到sys.maxunicode
迭代,为所有P
和Sm
代码点生成完整的str.translate()
映射,然后针对unicodedata.category()
测试这些值:
>>> import sys, unicodedata
>>> toremove = dict.fromkeys(i for i in range(0, sys.maxunicode + 1) if unicodedata.category(chr(i)).startswith(('P', 'Sm')))
>>> print text.translate(toremove)
Üäik
(For Python 3, replace unicode
with str
, and print ...
with print(...))
. (对于Python 3,更换unicode
与str
,并print ...
与print(...))
Try string
module 尝试string
模块
import string,re
text = u"<Üäik>"
out = re.sub('[%s]' % re.escape(string.punctuation), '', text)
print out
print type(out)
Prints- Prints-
Üäik
<type 'unicode'>
\\p{P}
matches punctuation characters. \\p{P}
匹配标点符号。
Those punctuation characters are 这些标点符号是
! ' # S % & ' ( ) * + , - . / : ; < = > ? @ [ / ] ^ _ { | } ~
<
and >
are not punctuation characters. <
和>
不是标点符号。 So they won't be removed. 因此它们不会被删除。
Try this instead 试试这个
re.sub('[\p{L}<>]+',"",text)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.