![](/img/trans.png)
[英]Python regex, remove all punctuation except hyphen for unicode string
[英]Python removing punctuation from unicode string except apostrophe
我发现了几个这方面的主题,我找到了这个解决方案:
sentence=re.sub(ur"[^\P{P}'|-]+",'',sentence)
这应该删除除了'之外的每个标点符号,问题是它还会删除句子中的所有其他标点符号。
例:
>>> sentence="warhol's art used many types of media, including hand drawing, painting, printmaking, photography, silk screening, sculpture, film, and music."
>>> sentence=re.sub(ur"[^\P{P}']+",'',sentence)
>>> print sentence
'
当然我想要的是保持句子没有标点符号,“warhol”保持原样
期望的输出:
"warhol's art used many types of media including hand drawing painting printmaking photography silk screening sculpture film and music"
"austro-hungarian empire"
编辑:我也试过用
tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)
if unicodedata.category(unichr(i)).startswith('P'))
sentence = sentence.translate(tbl)
但这会删除每个标点符号
指定所有你不想删除,即元素\\w
, \\d
, \\s
等,这是什么^
运营商,在方括号表示。 (匹配除外)
>>> import re
>>> sentence="warhol's art used many types of media, including hand drawing, painting, printmaking, photography, silk screening, sculpture, film, and music."
>>> print re.sub(ur"[^\w\d'\s]+",'',sentence)
warhol's art used many types of media including hand drawing painting printmaking photography silk screening sculpture film and music
>>>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.