[英]Python regex, remove all punctuation except hyphen for unicode string
I have this code for removing all punctuation from a regex string: 我有以下代码可用于删除正则表达式字符串中的所有标点符号:
import regex as re
re.sub(ur"\p{P}+", "", txt)
How would I change it to allow hyphens? 我如何更改它以允许连字符? If you could explain how you did it, that would be great.
如果您能解释您是如何做到的,那将很棒。 I understand that here, correct me if I'm wrong, P with anything after it is punctuation.
我了解在这里,如果我错了,请纠正我,在标点符号之后加上P。
[^\P{P}-]+
\\P
is the complementary of \\p
- not punctuation. \\P
是\\p
的补充-不是标点符号。 So this matches anything that is not (not punctuation or a dash) - resulting in all punctuation except dashes. 所以这个匹配任何不 (不带标点破折号) -导致除破折号所有标点符号。
Example: http://www.rubular.com/r/JsdNM3nFJ3 范例: http : //www.rubular.com/r/JsdNM3nFJ3
If you want a non-convoluted way, an alternative is \\p{P}(?<!-)
: match all punctuation, and then check it wasn't a dash (using negative lookbehind). 如果您希望采用非卷积方式,则可以选择
\\p{P}(?<!-)
:匹配所有标点符号,然后检查它是否不是破折号(使用负向后看)。
Working example: http://www.rubular.com/r/5G62iSYTdk 工作示例: http : //www.rubular.com/r/5G62iSYTdk
Here's how to do it with the re
module, in case you have to stick with the standard libraries: 如果您必须坚持使用标准库,则可以使用
re
模块执行以下操作:
# works in python 2 and 3
import re
import string
remove = string.punctuation
remove = remove.replace("-", "") # don't remove hyphens
pattern = r"[{}]".format(remove) # create the pattern
txt = ")*^%{}[]thi's - is - @@#!a !%%!!%- test."
re.sub(pattern, "", txt)
# >>> 'this - is - a - test'
If performance matters, you may want to use str.translate
, since it's faster than using a regex . 如果性能很重要,则可能要使用
str.translate
,因为它比使用regex更快 。 In Python 3, the code is txt.translate({ord(char): None for char in remove})
. 在Python 3中,代码为
txt.translate({ord(char): None for char in remove})
。
您可以指定要手动删除的标点符号,例如[._,]
也可以提供一个函数而不是替换字符串:
re.sub(r"\p{P}", lambda m: "-" if m.group(0) == "-" else "", text)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.