[英]Multiple negative lookbehind assertions in python regex?
I'm new to programming, sorry if this seems trivial: I have a text that I'm trying to split into individual sentences using regular expressions.我是编程新手,抱歉,如果这看起来微不足道:我有一个文本,我试图使用正则表达式将其拆分为单个句子。 With the
.split
method I search for a dot followed by a capital letter like使用
.split
方法,我搜索一个点,后跟一个大写字母,如
"\. A-Z"
However I need to refine this rule in the following way: The .
但是,我需要通过以下方式完善此规则:
.
(dot) may not be preceeded by either Abs
or S
. (dot) 前面不能有
Abs
或S
。 And if it is followed by a capital letter ( AZ
), it should still not match if it is a month name, like January | February | March
如果它后面跟着一个大写字母 (
AZ
),如果它是月份名称,它仍然应该不匹配,例如January | February | March
January | February | March
January | February | March
. January | February | March
。
I tried implementing the first half, but even this did not work.我尝试实施前半部分,但即使这样也不起作用。 My code was:
我的代码是:
"( (?<!Abs)\. A-Z) | (?<!S)\. A-Z) ) "
First, I think you may want to replace the space with \\s+
, or \\s
if it really is exactly one space (you often find double spaces in English text).首先,我认为你可能想用
\\s+
替换空格,或者\\s
如果它确实是一个空格(你经常在英文文本中发现双空格)。
Second, to match an uppercase letter you have to use [AZ]
, but AZ
will not work (but remember there may be other uppercase letters than AZ
...).其次,要匹配大写字母,您必须使用
[AZ]
,但AZ
不起作用(但请记住,可能还有其他大写字母而不是AZ
...)。
Additionally, I think I know why this does not work.此外,我想我知道为什么这不起作用。 The regular expression engine will try to match
\\. [AZ]
正则表达式引擎将尝试匹配
\\. [AZ]
\\. [AZ]
if it is not preceeded by Abs
or S
. \\. [AZ]
如果前面没有Abs
或S
。 The thing is that, if it is preceeded by an S
, it is not preceeded by Abs
, so the first pattern matches.问题是,如果前面是
S
,则前面不是Abs
,因此第一个模式匹配。 If it is preceeded by Abs
, it is not preceeded by S
, so the second pattern version matches.如果前面是
Abs
,则前面没有S
,因此第二个模式版本匹配。 In either way one of those patterns will match since Abs
and S
are mutually exclusive.无论哪种方式,这些模式中的一个都会匹配,因为
Abs
和S
是互斥的。
The pattern for the first part of your question could be您问题第一部分的模式可能是
(?<!Abs)(?<!S)(\. [A-Z])
or要么
(?<!Abs)(?<!S)(\.\s+[A-Z])
(with my suggestion) (根据我的建议)
That is because you have to avoid |
那是因为你必须避免
|
, without it the expression now says not preceeded by Abs and not preceeded by S . ,如果没有它,表达式现在表示前面没有Abs并且前面没有 S 。 If both are true the pattern matcher will continue to scan the string and find your match.
如果两者都为真,模式匹配器将继续扫描字符串并找到您的匹配项。
To exclude the month names I came up with this regular expression:为了排除我想出的这个正则表达式的月份名称:
(?<!Abs)(?<!S)(\.\s+)(?!January|February|March)[A-Z]
The same arguments hold for the negative look ahead patterns.同样的论点也适用于负面展望模式。
I'm adding a short answer to the question in the title, since this is at the top of Google's search results:我正在为标题中的问题添加一个简短的答案,因为它位于 Google 搜索结果的顶部:
The way to have multiple differently-lengthed negative lookbehinds is to chain them together like this:拥有多个不同长度的负向后视的方法是将它们链接在一起,如下所示:
"(?<!1)(?<!12)(?<!123)example"
This would match example
2example
and 3example
but not 1example
12example
or 123example
.这将匹配
example
2example
和3example
但不1example
12example
或123example
。
Use nltk punkt tokenizer .使用nltk punkt tokenizer 。 It's
probably more robust than using regex.它
可能比使用正则表达式更健壮。
>>> import nltk.data
>>> text = """
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries. And sometimes sentences
... can start with non-capitalized words. i is a good variable
... name.
... """
>>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
>>> print '\n-----\n'.join(sent_detector.tokenize(text.strip()))
Punkt knows that the periods in Mr. Smith and Johann S. Bach
do not mark sentence boundaries.
-----
And sometimes sentences
can start with non-capitalized words.
-----
i is a good variable
name.
Use nltk or similar tools as suggested by @root.使用@root 建议的 nltk 或类似工具。
To answer your regex question:要回答您的正则表达式问题:
import re
import sys
print re.split(r"(?<!Abs)(?<!S)\.\s+(?!January|February|March)(?=[A-Z])",
sys.stdin.read())
First. Second. January. Third. Abs. Forth. S. Fifth.
S. Sixth. ABs. Eighth
['First', 'Second. January', 'Third', 'Abs. Forth', 'S. Fifth',
'S. Sixth', 'ABs', 'Eighth']
You can use Set [].您可以使用设置 []。
'(?<![1,2,3]example)' '(?<![1,2,3]例子)'
This would not match 1example, 2example, 3example.这不会匹配 1example, 2example, 3example。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.