简体   繁体   English

python正则表达式中的多个负面回顾断言?

[英]Multiple negative lookbehind assertions in python regex?

I'm new to programming, sorry if this seems trivial: I have a text that I'm trying to split into individual sentences using regular expressions.我是编程新手,抱歉,如果这看起来微不足道:我有一个文本,我试图使用正则表达式将其拆分为单个句子。 With the .split method I search for a dot followed by a capital letter like使用.split方法,我搜索一个点,后跟一个大写字母,如

"\. A-Z"

However I need to refine this rule in the following way: The .但是,我需要通过以下方式完善此规则: . (dot) may not be preceeded by either Abs or S . (dot) 前面不能有AbsS And if it is followed by a capital letter ( AZ ), it should still not match if it is a month name, like January | February | March如果它后面跟着一个大写字母 ( AZ ),如果它是月份名称,它仍然应该不匹配,例如January | February | March January | February | March January | February | March . January | February | March

I tried implementing the first half, but even this did not work.我尝试实施前半部分,但即使这样也不起作用。 My code was:我的代码是:

"( (?<!Abs)\. A-Z) | (?<!S)\. A-Z) ) "

First, I think you may want to replace the space with \\s+ , or \\s if it really is exactly one space (you often find double spaces in English text).首先,我认为你可能想用\\s+替换空格,或者\\s如果它确实是一个空格(你经常在英文文本中发现双空格)。

Second, to match an uppercase letter you have to use [AZ] , but AZ will not work (but remember there may be other uppercase letters than AZ ...).其次,要匹配大写字母,您必须使用[AZ] ,但AZ不起作用(但请记住,可能还有其他大写字母而不是AZ ...)。

Additionally, I think I know why this does not work.此外,我想我知道为什么这不起作用。 The regular expression engine will try to match \\. [AZ]正则表达式引擎将尝试匹配\\. [AZ] \\. [AZ] if it is not preceeded by Abs or S . \\. [AZ]如果前面没有AbsS The thing is that, if it is preceeded by an S , it is not preceeded by Abs , so the first pattern matches.问题是,如果前面是S ,则前面不是Abs ,因此第一个模式匹配。 If it is preceeded by Abs , it is not preceeded by S , so the second pattern version matches.如果前面是Abs ,则前面没有S ,因此第二个模式版本匹配。 In either way one of those patterns will match since Abs and S are mutually exclusive.无论哪种方式,这些模式中的一个都会匹配,因为AbsS是互斥的。

The pattern for the first part of your question could be您问题第一部分的模式可能是

(?<!Abs)(?<!S)(\. [A-Z])

or要么

(?<!Abs)(?<!S)(\.\s+[A-Z])

(with my suggestion) (根据我的建议)

That is because you have to avoid |那是因为你必须避免| , without it the expression now says not preceeded by Abs and not preceeded by S . ,如果没有它,表达式现在表示前面没有Abs并且前面没有 S If both are true the pattern matcher will continue to scan the string and find your match.如果两者都为真,模式匹配器将继续扫描字符串并找到您的匹配项。

To exclude the month names I came up with this regular expression:为了排除我想出的这个正则表达式的月份名称:

(?<!Abs)(?<!S)(\.\s+)(?!January|February|March)[A-Z]

The same arguments hold for the negative look ahead patterns.同样的论点也适用于负面展望模式。

I'm adding a short answer to the question in the title, since this is at the top of Google's search results:我正在为标题中的问题添加一个简短的答案,因为它位于 Google 搜索结果的顶部:

The way to have multiple differently-lengthed negative lookbehinds is to chain them together like this:拥有多个不同长度的负向后视的方法是将它们链接在一起,如下所示:

"(?<!1)(?<!12)(?<!123)example"

This would match example 2example and 3example but not 1example 12example or 123example .这将匹配example 2example3example但不1example 12example123example

Use nltk punkt tokenizer .使用nltk punkt tokenizer It's probably more robust than using regex.可能比使用正则表达式更健壮。

>>> import nltk.data
>>> text = """
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries.  And sometimes sentences
... can start with non-capitalized words.  i is a good variable
... name.
... """
>>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
>>> print '\n-----\n'.join(sent_detector.tokenize(text.strip()))
Punkt knows that the periods in Mr. Smith and Johann S. Bach
do not mark sentence boundaries.
-----
And sometimes sentences
can start with non-capitalized words.
-----
i is a good variable
name.

Use nltk or similar tools as suggested by @root.使用@root 建议的 nltk 或类似工具。

To answer your regex question:要回答您的正则表达式问题:

import re
import sys

print re.split(r"(?<!Abs)(?<!S)\.\s+(?!January|February|March)(?=[A-Z])",
               sys.stdin.read())

Input输入

First. Second. January. Third. Abs. Forth. S. Fifth.
S. Sixth. ABs. Eighth

Output输出

['First', 'Second. January', 'Third', 'Abs. Forth', 'S. Fifth',
 'S. Sixth', 'ABs', 'Eighth']

You can use Set [].您可以使用设置 []。

'(?<![1,2,3]example)' '(?<![1,2,3]例子)'

This would not match 1example, 2example, 3example.这不会匹配 1example, 2example, 3example。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM