简体   繁体   English

Python正则表达式:删除下划线和破折号,除非它们在字符串替换指令中

[英]Python Regex: remove underscores and dashes except if they are in a dict of string substitutions

I'm pre-processing a string. 我正在预处理一个字符串。 I have a dictionary of 10k string substitutions (eg "John Lennon": "john_lennon" ). 我有一个10k字符串替换的字典(例如"John Lennon": "john_lennon" )。 I want to replace all other punctuation with a space. 我想用空格替换所有其他标点符号。

The problem is some of these string substitutions contain underscores or hyphens, so I want to replace punctuation (except full stops) with spaces unless the word is contained in the keys of this dict. 问题是这些字符串替换中有些包含下划线或连字符,因此我想用空格替换标点符号(句号除外),除非此字典的键中包含单词。 I also want to do it in one Regex expression since the text corpus is quite large and this could be a bottleneck. 我也想用一个 Regex表达式来实现,因为文本语料库很大,这可能是一个瓶颈。

So far, I have: 到目前为止,我有:

import re
input_str = "John Lennon: a musician, artist and activist."
multi_words = dict((re.escape(k), v) for k, v in multi_words.items())
pattern = re.compile("|".join(multi_words.keys()))
output_str = pattern.sub(lambda m: multi_words[re.escape(m.group(0))], input_str)

This replaces all strings using the keys in a dict. 这将使用字典中的键替换所有字符串。 Now I just need to also remove punctuation in the same pass. 现在,我只需要在同一遍中也删除标点符号。 This should return "john_lennon a musician artist and activist." 这应该返回"john_lennon a musician artist and activist."

You could handle the punctuation you would like to remove like the entries in your dictionary: 您可以像字典中的条目一样处理要删除的标点符号:

pattern = re.compile("|".join(multi_words.keys()) + r'|_|-')

and

multiwords['_'] = ' '
multiwords['-'] = ' '

Then these occurrences are treated like your key words. 然后将这些出现视为您的关键字。

But let me remind you that your code only works for a certain set of regular expressions. 但是让我提醒您,您的代码仅适用于某些正则表达式集。 If you have the pattern foo.*bar in your keys and that matches a string like foo123bar , you will not find the corresponding value to the key by passing foo123bar through re.escape() and then searching in your multiword dictionary for it. 如果您的键中包含模式foo.*bar并且与foo123bar类的字符串匹配,则无法通过将foo123bar通过re.escape()传递,然后在multiword字字典中进行搜索来找到与键对应的值。

I think the whole escaping you do should be removed and the code should be commented to make clear that only fixed strings are allowed as keys, not complex regular expressions matching variable inputs. 我认为应该删除所有转义的代码,并对代码进行注释,以明确只允许使用固定字符串作为键,而不是与变量输入匹配的复杂正则表达式。

You can add punctuations (excluding full stop) in a character set as part of the items to match, and then handle punctuations and dict keys separately in the substitution function: 您可以在字符集中添加标点符号(不包括句号)作为要匹配的项的一部分,然后在替换功能中分别处理标点符号和dict键:

import re
import string

punctuation = string.punctuation.replace('.', '')
pattern = re.compile("|".join(multi_words.keys())+
                     "|[{}]".format(re.escape(punctuation)))


def func(m):
   m = m.group(0)
   print(m, re.escape(m))
   if m in string.punctuation:
      return ''
   return multi_words[re.escape(m)]

output_str = pattern.sub(func , input_str)
print(output_str)
# john_lennon a musician artist and activist. 

You could do it by adding one more alternative to the constructed regex which matches a single punctuation character. 您可以通过在已构造的正则表达式中添加一个与单个标点字符匹配的替代项来实现。 When the match is processed, a match not in the dictionary can be replaced with a space, using dictionary's get method. 处理匹配项后,可以使用字典的get方法将不在字典中的匹配项替换为空格。 Here, I use [,:;_-] but you probably want to replace other characters. 在这里,我使用[,:;_-]但您可能想替换其他字符。

Note: I moved the call to re.escape into the construction of the regex to avoid having to call it on every match. 注意:我将re.escape的调用移到了正则表达式的构造中,以避免在每次比赛时都调用它。

import re
input_str = "John Lennon: a musician, artist and activist."
pattern = re.compile(("|".join(map(re.escape, multi_words.keys())) + "|[,:;_-]+")
output_str = pattern.sub(lambda m: multi_words.get(m.group(0), ' '), input_str)

You may use a regex like (?:alt1|alt2...|altN)|([^\\w\\s.]+) and check if Group 1 (that is, any punctuation other than . ) was matched. 您可以使用(?:alt1|alt2...|altN)|([^\\w\\s.]+)类的正则表达式,并检查第1组(即.以外的其他标点符号)是否匹配。 If yes, replace with an empty string: 如果是,请替换为空字符串:

pattern = re.compile(r"(?:{})|([^\w\s.]+)".format("|".join(multi_words.keys())))
output_str = pattern.sub(lambda m: "" if m.group(1) else multi_words[re.escape(m.group(0))], input_str)

See the Python demo . 参见Python演示

A note about _ : if you need to remove it as well, use r"(?:{})|([^\\w\\s.]+|_+)" because [^\\w\\s.] (matching any char other than word, whitespace and . chars) does not match an underscore (a word char) and you need to add it as a separate alternative. 关于_的注释:如果还需要删除它,请使用r"(?:{})|([^\\w\\s.]+|_+)"因为[^\\w\\s.] (匹配余万字,空格和其他任何字符.字符)不匹配下划线(一个字字符),你需要将其添加为一个独立的替代品。

Note on Unicode: if you deal with Unicode strings, in Python 2.x, pass re.U or re.UNICODE modifier flag to the re.compile() method. 关于Unicode的注意事项:如果处理Unicode字符串,则在Python 2.x中,将re.Ure.UNICODE修饰符标志传递给re.compile()方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 正则表达式可替换除小写字母,数字字符,下划线和破折号之外的所有内容 - regex to replace everything except lowercase letters, numeric characters, underscores, and dashes 正则表达式python删除冒号和下划线 - Regex python remove colons and underscores 我可以对字母,破折号和下划线使用python正则表达式吗? - Can I use a python regex for letters, dashes and underscores? 使用正则表达式python在字符串中多次替换数字 - Multiple substitutions of numbers in string using regex python Python正则表达式,删除除Unicode字符串的连字符以外的所有标点符号 - Python regex, remove all punctuation except hyphen for unicode string Python3正则表达式:删除/和|以外的所有字符 从字符串 - Python3 Regex: Remove all characters except / and | from string 删除所有数字,除了使用 python regex 组合成字符串的数字 - Remove all numbers except for the ones combined to string using python regex Python 用下划线替换字符串中除 3 个字符外的所有字符 - Python replace all characters in a string except for 3 characters with underscores 如何用除空格以外的破折号替换字符串中的字符python - how to replace characters within string with dashes except spaces python Python中的多个特定正则表达式替换 - Multiple, specific, regex substitutions in Python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM