python正则表达式中的多个负面回顾断言？

Question

I'm new to programming, sorry if this seems trivial: I have a text that I'm trying to split into individual sentences using regular expressions.我是编程新手，抱歉，如果这看起来微不足道：我有一个文本，我试图使用正则表达式将其拆分为单个句子。 With the .split method I search for a dot followed by a capital letter like使用.split方法，我搜索一个点，后跟一个大写字母，如

"\. A-Z"

However I need to refine this rule in the following way: The .但是，我需要通过以下方式完善此规则： . (dot) may not be preceeded by either Abs or S . (dot) 前面不能有Abs或S 。 And if it is followed by a capital letter ( AZ ), it should still not match if it is a month name, like January | February | March如果它后面跟着一个大写字母 ( AZ )，如果它是月份名称，它仍然应该不匹配，例如January | February | March January | February | March January | February | March . January | February | March 。

I tried implementing the first half, but even this did not work.我尝试实施前半部分，但即使这样也不起作用。 My code was:我的代码是：

"( (?<!Abs)\. A-Z) | (?<!S)\. A-Z) ) "

Answer 1

First, I think you may want to replace the space with \\s+ , or \\s if it really is exactly one space (you often find double spaces in English text).首先，我认为你可能想用\\s+替换空格，或者\\s如果它确实是一个空格（你经常在英文文本中发现双空格）。

Second, to match an uppercase letter you have to use [AZ] , but AZ will not work (but remember there may be other uppercase letters than AZ ...).其次，要匹配大写字母，您必须使用[AZ] ，但AZ不起作用（但请记住，可能还有其他大写字母而不是AZ ...）。

Additionally, I think I know why this does not work.此外，我想我知道为什么这不起作用。 The regular expression engine will try to match \\. [AZ]正则表达式引擎将尝试匹配\\. [AZ] \\. [AZ] if it is not preceeded by Abs or S . \\. [AZ]如果前面没有Abs或S 。 The thing is that, if it is preceeded by an S , it is not preceeded by Abs , so the first pattern matches.问题是，如果前面是S ，则前面不是Abs ，因此第一个模式匹配。 If it is preceeded by Abs , it is not preceeded by S , so the second pattern version matches.如果前面是Abs ，则前面没有S ，因此第二个模式版本匹配。 In either way one of those patterns will match since Abs and S are mutually exclusive.无论哪种方式，这些模式中的一个都会匹配，因为Abs和S是互斥的。

The pattern for the first part of your question could be您问题第一部分的模式可能是

(?<!Abs)(?<!S)(\. [A-Z])

or要么

(?<!Abs)(?<!S)(\.\s+[A-Z])

(with my suggestion) （根据我的建议）

That is because you have to avoid |那是因为你必须避免| , without it the expression now says not preceeded by Abs and not preceeded by S . ，如果没有它，表达式现在表示前面没有Abs并且前面没有 S 。 If both are true the pattern matcher will continue to scan the string and find your match.如果两者都为真，模式匹配器将继续扫描字符串并找到您的匹配项。

To exclude the month names I came up with this regular expression:为了排除我想出的这个正则表达式的月份名称：

(?<!Abs)(?<!S)(\.\s+)(?!January|February|March)[A-Z]

The same arguments hold for the negative look ahead patterns.同样的论点也适用于负面展望模式。

Answer 2

I'm adding a short answer to the question in the title, since this is at the top of Google's search results:我正在为标题中的问题添加一个简短的答案，因为它位于 Google 搜索结果的顶部：

The way to have multiple differently-lengthed negative lookbehinds is to chain them together like this:拥有多个不同长度的负向后视的方法是将它们链接在一起，如下所示：

"(?<!1)(?<!12)(?<!123)example"

This would match example 2example and 3example but not 1example 12example or 123example .这将匹配example 2example和3example但不1example 12example或123example 。

Answer 3

Use nltk punkt tokenizer .使用nltk punkt tokenizer 。 It's ~~probably~~ more robust than using regex.它可能比使用正则表达式更健壮。

>>> import nltk.data
>>> text = """
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries.  And sometimes sentences
... can start with non-capitalized words.  i is a good variable
... name.
... """
>>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
>>> print '\n-----\n'.join(sent_detector.tokenize(text.strip()))
Punkt knows that the periods in Mr. Smith and Johann S. Bach
do not mark sentence boundaries.
-----
And sometimes sentences
can start with non-capitalized words.
-----
i is a good variable
name.

Answer 4

Use nltk or similar tools as suggested by @root.使用@root 建议的 nltk 或类似工具。

To answer your regex question:要回答您的正则表达式问题：

import re
import sys

print re.split(r"(?<!Abs)(?<!S)\.\s+(?!January|February|March)(?=[A-Z])",
               sys.stdin.read())

Input输入

First. Second. January. Third. Abs. Forth. S. Fifth.
S. Sixth. ABs. Eighth

Output输出

['First', 'Second. January', 'Third', 'Abs. Forth', 'S. Fifth',
 'S. Sixth', 'ABs', 'Eighth']

Answer 5

You can use Set [].您可以使用设置 []。

'(?<![1,2,3]example)' '(?<![1,2,3]例子)'

This would not match 1example, 2example, 3example.这不会匹配 1example, 2example, 3example。

python正则表达式中的多个负面回顾断言？

问题描述

5 个解决方案

解决方案1
19 已采纳 2012-10-02 11:16:14

解决方案2
4 2019-07-12 08:13:12

解决方案3
1 2012-10-02 11:13:47

解决方案4
1 2012-10-02 11:41:00

Input输入

Output输出

解决方案5
-2 2020-06-11 08:43:48

python正则表达式中的多个负面回顾断言？

问题描述

5 个解决方案

解决方案1 19 已采纳 2012-10-02 11:16:14

解决方案2 4 2019-07-12 08:13:12

解决方案3 1 2012-10-02 11:13:47

解决方案4 1 2012-10-02 11:41:00

Input输入

Output输出

解决方案5 -2 2020-06-11 08:43:48

解决方案1
19 已采纳 2012-10-02 11:16:14

解决方案2
4 2019-07-12 08:13:12

解决方案3
1 2012-10-02 11:13:47

解决方案4
1 2012-10-02 11:41:00

解决方案5
-2 2020-06-11 08:43:48