简体   繁体   English

从Unicode字符串中去除特殊字符和标点符号

[英]Strip special characters and punctuation from a unicode string

I'm trying to remove the punctuation from a unicode string, which may contain non-ascii letters. 我正在尝试从可能包含非ASCII字母的Unicode字符串中删除标点符号。 I tried using the regex module: 我尝试使用regex模块:

import regex
text = u"<Üäik>"
regex.sub(ur"\p{P}+", "", text)

However, I've noticed that the characters < and > don't get removed. 但是,我注意到字符<>不会被删除。 Does anyone know why and is there any other way to strip punctuation from unicode strings? 有谁知道为什么,还有其他方法可以从Unicode字符串中删除标点符号吗?

EDIT: Another approach I've tried out is doing: 编辑:我尝试过的另一种方法是:

import string
text = text.encode("utf8").translate(None, string.punctuation).decode("utf8")

but I would like to avoid converting the text from unicode to string and backwards. 但我想避免将文本从unicode转换为字符串和向后转换。

< and > are classified as Math Symbols (Sm) , not Punctuation (P). <>归为数学符号(Sm)而不是标点符号(P)。 You can match either: 您可以匹配:

regex.sub('[\p{P}\p{Sm}]+', '', text)

The unicode.translate() method exists too and takes a dictionary mapping integer numbers (codepoints) to either other integer codepoints, a unicode character, or None ; unicode.translate()方法也存在,并且使用字典将整数(代码点)映射到其他整数代码点,unicode字符或None None removes that codepoint. None将删除该代码点。 Map string.punctuation to codepoints with ord() : 使用ord() string.punctuation到代码点:

text.translate(dict.fromkeys(ord(c) for c in string.punctuation))

That only removes only the limited number of ASCII punctuation characters. 这样只会删除有限数量的ASCII标点符号。

Demo: 演示:

>>> import regex
>>> text = u"<Üäik>"
>>> print regex.sub('[\p{P}\p{Sm}]+', '', text)
Üäik
>>> import string
>>> print text.translate(dict.fromkeys(ord(c) for c in string.punctuation))
Üäik

If string.punctuation is not enough, then you can generate a complete str.translate() mapping for all P and Sm codepoints by iterating from 0 to sys.maxunicode , then test those values against unicodedata.category() : 如果string.punctuation还不够,那么可以通过从0到sys.maxunicode迭代,为所有PSm代码点生成完整的str.translate()映射,然后针对unicodedata.category()测试这些值:

>>> import sys, unicodedata
>>> toremove = dict.fromkeys(i for i in range(0, sys.maxunicode + 1) if unicodedata.category(chr(i)).startswith(('P', 'Sm')))
>>> print text.translate(toremove)
Üäik

(For Python 3, replace unicode with str , and print ... with print(...)) . (对于Python 3,更换unicodestr ,并print ...print(...))

Try string module 尝试string模块

import string,re
text = u"<Üäik>"
out = re.sub('[%s]' % re.escape(string.punctuation), '', text)
print out
print type(out)

Prints- Prints-

Üäik
<type 'unicode'>

\\p{P} matches punctuation characters. \\p{P}匹配标点符号。

Those punctuation characters are 这些标点符号是

! ' # S % & ' ( ) * + , - . / : ; < = > ? @ [ / ] ^ _ { | } ~

< and > are not punctuation characters. <>不是标点符号。 So they won't be removed. 因此它们不会被删除。

Try this instead 试试这个

re.sub('[\p{L}<>]+',"",text)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从Python字符串中删除unicode“标点符号” - How to strip unicode “punctuation” from Python string 从Python中的unicode字符串中删除标点符号的最快方法 - Fastest way to strip punctuation from a unicode string in Python 如何从任何非 unicode \\ 特殊字符、html 标记、js 中清除字符串 - 留下纯文本和标点符号 - 在 python 中? - How clean the string from any none unicode \ special characters, html markup, js - leaving pure text and punctuation - in python? 从字符串中删除所有特殊字符、标点符号和空格 - Remove all special characters, punctuation and spaces from string 如何在python中从字符串的开头和结尾去除特殊字符 - How to strip special characters from the start and end of the string in python 向量化字符串,包括标点符号和特殊字符 - Vectorize string including punctuation and special characters 从字符串中去除标点符号的最佳方法 - Best way to strip punctuation from a string 将字符串转换为带有特殊字符的unicode - Convert a string to unicode with special characters 从字符串中删除所有特殊字符,标点符号,并将其限制为前200个字符 - remove all special characters, punctuation from string and limit it to first 200 characters 从 python 中的字符串中去除非字母数字字符,但保留特殊字符 - Strip Non alpha numeric characters from string in python but keeping special characters
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM