简体   繁体   English

python正则表达式否定超前方法

[英]python regex negative lookahead method

I'm now extracting firm's name from the text data(10-k statement data). 我现在从文本数据(10-k语句数据)中提取公司名称。

I first tried using nltk StanfordTagger and extracted all the word tagged as organization. 我首先尝试使用nltk StanfordTagger提取所有标记为组织的单词。 However, it quiet often failed to recall all the names of firms, and as I'm applying tagger to every single related sentence, it took such a long time. 但是,它通常无法回忆起公司的所有名称,并且由于我在每个相关的句子中都使用了标记器,因此花费了很长时间。

So, I'm trying to extract all the words starting with Capital letter(or the words characters are all comprised of Capital letters). 因此,我正在尝试提取所有以大写字母开头的单词(或所有字符都由大写字母组成)。

So I find out that the regex below helpful. 所以我发现下面的正则表达式很有帮助。

(?:[A-Z]+[a-zA-Z\-0-9]*\.?\s?)+

However, It cannot distinguish the name of segment from the name of firm. 但是,它不能区分段的名称和公司的名称。

For example, 例如,

sentence : The Company's customers include, among others, Conner Peripherals Inc.("Conner"), Maxtor Corporation ("Maxtor"). 句子:公司的客户包括Conner Peripherals Inc.(以下简称“ Conner”),Maxtor Corporation(以下简称“ Maxtor”)。 The largest proportion of Applieds consolidated net sales and profitability has been and continues to be derived from sales of manufacturing equipment in the Silicon Systems segment to the global semiconductor industry. Applieds合并净销售额和利润率中最大的份额一直是,并将继续来自Silicon Systems部门向全球半导体行业销售制造设备。

I want to extract Conner Peripherals Inc, Conner, Maxtor Corporation, Maxtor, Applieds, but not 'Silicon Systems' since it is the name of segment. 我想提取Conner Peripherals Inc,Conner,Maxtor Corporation,Maxtor,Applieds,但不提取“ Silicon Systems”,因为它是网段的名称。

So, I tried using 所以,我尝试使用

(?:[A-Z]+[a-zA-Z\-0-9]*\.?\s?)+(?!segment|Segment)

However, it still extract 'Silicon Systems'. 但是,它仍然提取“ Silicon Systems”。

Could you help me solving this problem? 您能帮我解决这个问题吗?

(Or do you have any idea of how to extract only the firm's name from the text data?) (或者您对如何仅从文本数据中提取公司名称有任何想法?)

Thanks a lot!!! 非常感谢!!!

You need to capture all consecutive texts! 您需要捕获所有连续的文本! and then, mark individual words starting with caps as non-capturing( ?: ) so that you can capture consecutive words! 然后,将以大写字母开头的单个单词标记为non-captureing( ?: :),以便可以捕获连续的单词!

>>> re.findall("((?:[A-Z]+[a-zA-Z\-0-9']*\.?\s?)+)+?(?![Ss]egment)",sentence)
["The Company's ", 'Conner Peripherals Inc.', 'Conner', 'Maxtor Corporation ', 'Maxtor', 'The ', 'Applieds ', '']

The NLTK approach, or any machine learning, seems to be a better approach here. NLTK方法或任何机器学习似乎是一种更好的方法。 I can only explain what the difficulty and current issue with the regex approach are. 我只能解释正则表达式方法的困难和当前问题。

The problem is that the matches expected can contain space separated phrases, and you want to avoid matching a certain phrase ending with segment . 问题在于,期望的匹配项可以包含以空格分隔的短语,并且您要避免匹配以segment结尾的某些短语。 Even if you correct the negative lookahead as (?!\\s*[Ss]egment) , and make the pattern linear with something like \\b[AZ][a-zA-Z0-9-]*(?:\\s+[AZ][a-zA-Z0-9-]*)*\\b\\.?(?!\\s+[sS]egment) , you will still match Silicon , a part of the unwanted match. 即使您将负前瞻校正为(?!\\s*[Ss]egment) ,并使用\\b[AZ][a-zA-Z0-9-]*(?:\\s+[AZ][a-zA-Z0-9-]*)*\\b\\.?(?!\\s+[sS]egment) ,您仍将匹配Silicon ,这是不需要的匹配的一部分。

What you might try to do is to match all these entities and discard after matching, and only keep those entities in other contexts by capturing them into Group 1. 您可能想做的是匹配所有这些实体,并在匹配后丢弃,并通过将它们捕获到组1中,仅将这些实体保留在其他上下文中。

See the sample regex for this : 参见示例正则表达式

\b[A-Z][a-zA-Z0-9-]*(?:\s+[A-Z][a-zA-Z0-9-]*)*\b\.?\s+[sS]egment\b|(\b[A-Z][a-zA-Z0-9-]*(?:\s+[A-Z][a-zA-Z0-9-]*)*\b\.?)

Since it is unwieldy, you should think of building it from blocks, dynamically: 由于它很笨拙,您应该考虑从块动态构建它:

import re
entity_rx = r"\b[A-Z][a-zA-Z0-9-]*(?:\s+[A-Z][a-zA-Z0-9-]*)*\b\.?"
rx = r"{0}\s+[sS]egment\b|({0})".format(entity_rx)
s = "The Company's customers include, among others, Conner Peripherals Inc.(\"Conner\"), Maxtor Corporation (\"Maxtor\"). The largest proportion of Applieds consolidated net sales and profitability has been and continues to be derived from sales of manufacturing equipment in the Silicon Systems segment to the global semiconductor industry."
matches = filter(None, re.findall(rx, s))
print(matches) 
# => ['The Company', 'Conner Peripherals Inc.', 'Conner', 'Maxtor Corporation', 'Maxtor', 'The', 'Applieds']

So, 所以,

  • \\b - matches a word boundary \\b匹配单词边界
  • [AZ][a-zA-Z0-9-]* - an uppercase letter followed with letters/digits/ - [AZ][a-zA-Z0-9-]* -大写字母,后跟字母/数字/ -
  • (?:\\s+[AZ][a-zA-Z0-9-]*)* - zero or more sequences of (?:\\s+[AZ][a-zA-Z0-9-]*)* -零个或多个序列
    • \\s+ - 1+ whitespaces \\s+ -1+空格
    • [AZ][a-zA-Z0-9-]* - an uppercase letter followed with letters/digits/ - [AZ][a-zA-Z0-9-]* -大写字母,后跟字母/数字/ -
  • \\b - trailing word boundary \\b尾随单词边界
  • \\.? - an optional . -可选的.

Then, this block is used to build 然后,此块用于构建

  • {0}\\s+[sS]egment\\b - the block we defined before followed with {0}\\s+[sS]egment\\b我们之前定义的块,后跟
    • \\s+ - 1+ whitespaces \\s+ -1+空格
    • [sS]egment\\b - either segment or Segment whole words [sS]egment\\b segmentSegment整个单词
  • | - or - 要么
  • ({0}) - Group 1 (what re.findall actually returns): the block we defined before. ({0}) -第1组( re.findall实际返回的内容):我们之前定义的块。

filter(None, res) (in Python 2.x, in Python 3.x use list(filter(None, re.findall(rx, s))) ) will filter out empty items in the final list. filter(None, res) (在Python 2.x中,在Python 3.x中使用list(filter(None, re.findall(rx, s))) )会过滤掉最终列表中的空项目。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM