[英]Can I write a RegEx which matches a pattern, and have part of that pattern be an inverse match?
I want to write a RegEx to remove ellipses from a large text.我想编写一个正则表达式来删除大文本中的省略号。
I need to find a series of two or more dots, possibly with spaces between them, possibly without.我需要找到一系列两个或多个点,它们之间可能有空格,也可能没有。 The RegEx I'm using is finding instances of full stops which I don't want to remove, so I want part of the RegEx pattern to negate the pattern if it's followed by a particular string.
我正在使用的 RegEx 正在查找我不想删除的句号的实例,所以我希望 RegEx 模式的一部分在它后面跟着一个特定的字符串时否定该模式。
I've been using this pattern: re.compile(r'\.[ \.]*\.')
我一直在使用这种模式:
re.compile(r'\.[ \.]*\.')
The problem with this is that there are some legitimate abbreviations in the text which are being caught by this.这样做的问题是文本中有一些合法的缩写被此捕获。
Take this text for example:以这段文字为例:
1. Here are ... some . . ellipses..
2. This. . .is ellipsis also.
3. Here is an abbreviation. .i.
In the example above, I want my pattern to find only the ...
, . .
在上面的例子中,我希望我的模式只找到
...
, . .
. .
, ..
, and . . .
,
..
和. . .
. . .
in lines 1 and 2. I don't want it to find anything in line 3, however, it will find . .
在第 1 行和第 2 行中。我不希望它在第 3 行中找到任何内容,但是,它会找到
. .
. .
in it.在里面。
I could update the RegEx to exclude patterns if they're preceded or followed by the letter i
like this: re.compile(r'[^i]\.[ \.]*\.'[^i])
but then the pattern won't find the ellipsis in line 2.我可以更新 RegEx 以排除模式,如果它们之前或之后是这样的字母
i
: re.compile(r'[^i]\.[ \.]*\.'[^i])
但随后模式不会在第 2 行中找到省略号。
Ideally I'd be able to negate a whole sub-string within the pattern so that it won't think . .
理想情况下,我可以否定模式中的整个子字符串,这样它就不会认为
. .
. .
is ellipsis if it's followed by i.
如果后面跟着
i.
or preceded by .i
, however, I haven't been able to find any way to do this.或前面有
.i
,但是,我无法找到任何方法来做到这一点。 Is it possible?可能吗?
Use negative look ahead and negative look behind:使用负面展望和负面展望:
import re
text = """
1. Here are ... some . . ellipses..
2. This. . .is ellipsis also.
3. Here is an abbreviation. .i.
"""
pattern = re.compile(r'(?<!\.i)\.[ \.]*\.(?!i\.)')
print(pattern.findall(text)) # ['...', '. .', '..', '. . .']
print(pattern.sub('', text))
Text after removing .
删除后的文本
.
sequence:序列:
1. Here are some ellipses
2. Thisis ellipsis also.
3. Here is an abbreviation. .i.
avoid sequence of .
避免
.
followed by i.
其次是
i.
you must include another character with i
to handle this case:您必须在
i
中包含另一个字符才能处理这种情况:
. . .is
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.