简体   繁体   English

如何在Python正则表达式后附加否定后置列表?

[英]How do I append a list of negative lookbehinds to a python regular expression?

I'm trying to split a paragraph into sentences using regex split and I'm trying to use the second answer posted here: a Regex for extracting sentence from a paragraph in python 我正在尝试使用正则表达式split将段落拆分为句子,并且尝试使用此处发布的第二个答案: 一个正则表达式,用于从python的一个段落中提取句子

But I have a list of abbreviations that I don't want to end the sentence on even though there's a period. 但是我有一个缩写列表,即使有句号也不想结束句子。 But I don't know how to append it to that regular expression properly. 但是我不知道如何正确地将其附加到该正则表达式。 I'm reading in the abbreviations from a file that contains terms like Mr. Ms. Dr. St. (one on each line). 我正在从文件中读取缩写,该文件包含诸如Dr. St.先生之类的词(每行一个)。

Short answer: You can't, unless all lookbehind assertions are of the same, fixed width (which they probably aren't in your case; your example contained only two-letter abbreviations, but Mrs. would break your regex). 简短的答案:您不能这样做,除非所有后向断言都具有相同的固定宽度(在您的情况下,它们可能不是固定宽度;您的示例仅包含两个字母的缩写,但是Mrs.将破坏您的正则表达式)。

This is a limitation of the current Python regex engine. 这是当前Python regex引擎的限制。

Longer answer: 更长的答案:

You could write a regex like (?s)(?<!.Mr|Mrs|.Ms|.St)\\. 可以编写一个正则表达式,例如(?s)(?<!.Mr|Mrs|.Ms|.St)\\. , padding each alternating part of the lookbehind assertion with as many . ,用尽可能多的填充后面断言断言的每个交替部分. s as needed to get all of them to the same width. 根据需要将它们都设置为相同的宽度。 However, that would fail in some circumstances, for example when a paragraph begins with Mr. . 但是,在某些情况下(例如,以。 Mr.开头的段落),这样做可能会失败。

Anyway, you're not using the right tool here. 无论如何,您在这里没有使用正确的工具。 Better use a tool designed for the job, for example the Natural Language Toolkit . 最好使用针对该工作而设计的工具,例如Natural Language Toolkit

If you're stuck with regex (too bad!), then you could try and use a findall() approach instead of split() : 如果您对正则表达式感到困惑(太糟糕了!),那么您可以尝试使用findall()方法代替split()

(?:(?:\b(?:Mr|Ms|Dr|Mrs|St)\.)|[^.])+\.\s*

would match a sentence that ends in . 将匹配以结尾的句子. (optionally followed by whitespace) and may contain no dots unless preceded by one of the allowed abbreviations. (可选的后跟空白),并且除非前面带有允许的缩写之一,否则不得包含任何点。

>>> import re
>>> s = "My name is Mr. T. I pity the fool who's not on the A-Team."
>>> re.findall(r"(?:(?:\b(?:Mr|Ms|Dr|Mrs|St)\.)|[^.])+\.\s*", s)
['My name is Mr. T. ', "I pity the fool who's not on the A-Team."]

I don't directly answer your question, but this post should contain enough information for you to write a working regex for your problem. 我没有直接回答您的问题,但是这篇文章应该包含足够的信息,供您编写问题的正则表达式。

You can append a list of negative look-behinds. 可以附加否定的回溯列表。 Remember that look-behinds are zero-width, which means that you can put as many look-behinds as you want next to each other, and you are still look-behind from the same position. 请记住,后向零宽度是零,这意味着您可以将任意数量的后向彼此并排放置,而您仍在同一位置。 As long as you don't need to use "many" quantifier (eg * , + , {n,} ) in the look-behind, everything should be fine (?). 只要您不需要在后面使用“很多”量词(例如*+{n,} ),一切都应该很好(?)。

So the regex can be constructured like this: 因此可以将正则表达式构造如下:

(?<!list )(?<!of )(?<!words )(?<!not )(?<!allowed )(?<!to )(?<!precede )pattern\w+

It is a bit too verbose. 这有点太冗长。 Anyway, I write this post just to demonstrate that it is possible to look-behind on a list of fixed string. 无论如何,我写这篇文章只是为了证明可以在固定字符串列表上进行查找。

Example run: 示例运行:

>>> s = 'something patterning of patterned crap patternon not patterner, not allowed patternes to patternsses, patternet'
>>> re.findall(r'(?<!list )(?<!of )(?<!words )(?<!not )(?<!allowed )(?<!to )(?<!precede )pattern\w+', s)
['patterning', 'patternon', 'patternet']

There is a catch in using look-behind, though. 不过,使用后向搜索有一个问题。 If there are dynamic number of spaces between the blacklisted text and the text matching the pattern, the regex above will fail. 如果列入黑名单的文本和与模式匹配的文本之间存在动态数量的空格,则上述正则表达式将失败。 I really doubt there exists a way to modify the regex so that it works for the case above while keeping the look-behinds . 我真的很怀疑是否存在一种修改正则表达式的方法,以使其在保持前瞻性的同时适用于上述情况。 (You can always replace consecutive spaces into 1, but it won't work for more general cases). (您始终可以将连续的空格替换为1,但在更一般的情况下将无效)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM