简体   繁体   English

我可以编写一个匹配模式的正则表达式,并且该模式的一部分是反向匹配吗?

[英]Can I write a RegEx which matches a pattern, and have part of that pattern be an inverse match?

I want to write a RegEx to remove ellipses from a large text.我想编写一个正则表达式来删除大文本中的省略号。

I need to find a series of two or more dots, possibly with spaces between them, possibly without.我需要找到一系列两个或多个点,它们之间可能有空格,也可能没有。 The RegEx I'm using is finding instances of full stops which I don't want to remove, so I want part of the RegEx pattern to negate the pattern if it's followed by a particular string.我正在使用的 RegEx 正在查找我不想删除的句号的实例,所以我希望 RegEx 模式的一部分在它后面跟着一个特定的字符串时否定该模式。

I've been using this pattern: re.compile(r'\.[ \.]*\.')我一直在使用这种模式: re.compile(r'\.[ \.]*\.')

The problem with this is that there are some legitimate abbreviations in the text which are being caught by this.这样做的问题是文本中有一些合法的缩写被此捕获。

Take this text for example:以这段文字为例:

1. Here are ... some . . ellipses..
2. This. . .is ellipsis also.
3. Here is an abbreviation. .i.

In the example above, I want my pattern to find only the ... , . .在上面的例子中,我希望我的模式只找到... , . . . . , .. , and . . . , ... . . . . . in lines 1 and 2. I don't want it to find anything in line 3, however, it will find . .在第 1 行和第 2 行中。我不希望它在第 3 行中找到任何内容,但是,它会找到. . . . in it.在里面。

I could update the RegEx to exclude patterns if they're preceded or followed by the letter i like this: re.compile(r'[^i]\.[ \.]*\.'[^i]) but then the pattern won't find the ellipsis in line 2.我可以更新 RegEx 以排除模式,如果它们之前或之后是这样的字母ire.compile(r'[^i]\.[ \.]*\.'[^i])但随后模式不会在第 2 行中找到省略号。

Ideally I'd be able to negate a whole sub-string within the pattern so that it won't think . .理想情况下,我可以否定模式中的整个子字符串,这样它就不会认为. . . . is ellipsis if it's followed by i.如果后面跟着i. or preceded by .i , however, I haven't been able to find any way to do this.或前面有.i ,但是,我无法找到任何方法来做到这一点。 Is it possible?可能吗?

Use negative look ahead and negative look behind:使用负面展望和负面展望:

import re

text = """
1. Here are ... some . . ellipses..
2. This. . .is ellipsis also.
3. Here is an abbreviation. .i.
"""

pattern = re.compile(r'(?<!\.i)\.[ \.]*\.(?!i\.)')
print(pattern.findall(text))   # ['...', '. .', '..', '. . .']
print(pattern.sub('', text))

Text after removing .删除后的文本. sequence:序列:

1. Here are  some  ellipses
2. Thisis ellipsis also.
3. Here is an abbreviation. .i.

avoid sequence of .避免. followed by i.其次是i. you must include another character with i to handle this case:您必须在i中包含另一个字符才能处理这种情况:

     . . .is

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM