简体   繁体   English

Python正则表达式,删除除Unicode字符串的连字符以外的所有标点符号

[英]Python regex, remove all punctuation except hyphen for unicode string

I have this code for removing all punctuation from a regex string: 我有以下代码可用于删除正则表达式字符串中的所有标点符号:

import regex as re    
re.sub(ur"\p{P}+", "", txt)

How would I change it to allow hyphens? 我如何更改它以允许连字符? If you could explain how you did it, that would be great. 如果您能解释您是如何做到的,那将很棒。 I understand that here, correct me if I'm wrong, P with anything after it is punctuation. 我了解在这里,如果我错了,请纠正我,在标点符号之后加上P。

[^\P{P}-]+

\\P is the complementary of \\p - not punctuation. \\P\\p的补充-不是标点符号。 So this matches anything that is not (not punctuation or a dash) - resulting in all punctuation except dashes. 所以这个匹配任何 (不带标点破折号) -导致除破折号所有标点符号。

Example: http://www.rubular.com/r/JsdNM3nFJ3 范例: http//www.rubular.com/r/JsdNM3nFJ3

If you want a non-convoluted way, an alternative is \\p{P}(?<!-) : match all punctuation, and then check it wasn't a dash (using negative lookbehind). 如果您希望采用非卷积方式,则可以选择\\p{P}(?<!-) :匹配所有标点符号,然后检查它是否不是破折号(使用负向后看)。
Working example: http://www.rubular.com/r/5G62iSYTdk 工作示例: http : //www.rubular.com/r/5G62iSYTdk

Here's how to do it with the re module, in case you have to stick with the standard libraries: 如果您必须坚持使用标准库,则可以使用re模块执行以下操作:

# works in python 2 and 3
import re
import string

remove = string.punctuation
remove = remove.replace("-", "") # don't remove hyphens
pattern = r"[{}]".format(remove) # create the pattern

txt = ")*^%{}[]thi's - is - @@#!a !%%!!%- test."
re.sub(pattern, "", txt) 
# >>> 'this - is - a - test'

If performance matters, you may want to use str.translate , since it's faster than using a regex . 如果性能很重要,则可能要使用str.translate ,因为它比使用regex更快 In Python 3, the code is txt.translate({ord(char): None for char in remove}) . 在Python 3中,代码为txt.translate({ord(char): None for char in remove})

您可以指定要手动删除的标点符号,例如[._,]也可以提供一个函数而不是替换字符串:

re.sub(r"\p{P}", lambda m: "-" if m.group(0) == "-" else "", text)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python 3 正则表达式:删除所有标点符号,特殊单词模式除外 - Python 3 Regex: remove all punctuation, except special word pattern Python从除撇号之外的unicode字符串中删除标点符号 - Python removing punctuation from unicode string except apostrophe python regex,删除除撇号外的转义字符和标点符号 - python regex, remove escape characters and punctuation except for apostrophe Python正则表达式删除网址和十进制数字之外的标点符号 - Python regex to remove punctuation except from URLs and decimal numbers 删除字符串中的所有标点符号,除非数字之间 - Remove all punctuation from string, except if it's between digits 字节“字符串”中字符的正则表达式模式,除了特定的标点符号 - Python3 - Regex pattern for characters in byte “string” except for specific punctuation - Python3 Python3正则表达式:删除/和|以外的所有字符 从字符串 - Python3 Regex: Remove all characters except / and | from string 删除所有数字,除了使用 python regex 组合成字符串的数字 - Remove all numbers except for the ones combined to string using python regex 从字符串中删除所有标点符号 - Remove all punctuation from string Python:删除字母数字字符串前后的所有标点符号,但字母数字字符串中的标点符号保持不变 - Python: remove all punctuation before and after a alphanumeric string, but leave punctuation within the alphanumeric string unchanged
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM