简体   繁体   English

正则表达式灾难性的回溯; 提取单词以大写字母开头,然后是特定单词

[英]regex catastrophic backtracking ; extracting words starts with capital before the specific word

I'm relatively new to Python world and having trouble with regex. 我是Python世界的新手,正则表达式遇到了麻烦。

I'm trying to extract Firm's name before the word 'sale(s)' (or Sale(s)). 我正在尝试在“ sale(s)”(或“ sale(s)”)一词前提取公司的名称。

I found that Firm's names in my text data are all start with capital letter(and the other parts can be lowercase or uppercase or numbers or '-' or ', for example 'Abc Def' or 'ABC DEF' or just 'ABC' or 'Abc'), 我发现我的文本数据中的公司名称都以大写字母开头(其他部分可以是小写或大写字母或数字或“-”或“,例如,“ Abc Def”或“ ABC DEF”或仅是“ ABC”或“ Abc”),

and some of them are taking forms like ('Abc and Def' or 'Abc & Def'). 其中一些采用的格式如(“ Abc Def”或“ Abc Def”)。

For example, 例如,

from the text, 从文本中

;;;;;PRINCIPAL CUSTOMERS In fiscal 2005, the Company derived approximately 21% ($4,782,852) of its consolidated revenues from continuing operations from direct transactions with Kmart Corporation. 主要客户在2005财年,公司持续经营中与Kmart Corporation的直接交易产生了约21%的合并收入(4,782,852美元)。 Sales of Computer products was good. 电脑产品的销售良好。 However, Computer's Parts and Display Segment sale has been decreasing. 但是,计算机零件和显示分部的销售一直在下降。

I only want to extract 'Computer's Parts and Display Segment'. 我只想提取“计算机的零件和显示段”。

So I tried to create a regex 所以我试图创建一个正则表达式

((?:(?:[A-Z]+[a-zA-Z\-0-9\']*\.?\s?(?:and |\& )?)+)+?(?:[S|s]ales?\s))

( 1.[AZ]+[a-zA-Z-0-9\\']*.?\\s => this part is to find words start with capital letter and other parts are composed of az or AZ or 0-9 or - or ' or . . (1. [AZ] + [a-zA-Z-0-9 \\'] *。?\\ s =>此部分用于查找以大写字母开头的单词,而其他部分则由az或AZ或0-9组成或-或'或。

  1. (?:and |\\& )? (?:和| \\&)? => this part is to match word with and or & ) =>这部分是将单词与和或&匹配)

However, at https://regex101.com/ it calls out catastrophic backtracking and I read some related articles, but still cannot find way to solve this problem. 但是,在https://regex101.com/上,它指出了灾难性的回溯,我阅读了一些相关文章,但仍然找不到解决此问题的方法。

Could you help me? 你可以帮帮我吗?

Thanks! 谢谢!

Overview 总览

Pointing out a few things in your pattern: 指出您的模式中的一些事项:

  • [a-zA-Z\\-0-9\\'] You don't need to escape ' here. [a-zA-Z\\-0-9\\']您无需在此处转义' Also, you can just place - at the start or end of the set and you won't need to escape it. 另外,您只需放置-在集合的开头或结尾,就不必对其进行转义。
  • \\& The ampersand character doesn't need to be escaped. \\&不需要转义&字符。
  • [S|s] Says to match either S , | [S|s]说匹配S| , or s , thus you could potentially match |ales . |或s ,因此您可能会匹配|ales The correct way to write this is [Ss] . 正确的写法是[Ss]

Code

See regex in use here 查看正则表达式在这里使用

(?:(?:[A-Z][\w'-]*|and) +)+(?=[sS]ales?)

Results 结果

Input 输入项

;;;;;PRINCIPAL CUSTOMERS In fiscal 2005, the Company derived approximately 21% ($4,782,852) of its consolidated revenues from continuing operations from direct transactions with Kmart Corporation. 主要客户在2005财年,公司持续经营中与Kmart Corporation的直接交易产生了约21%的合并收入(4,782,852美元)。 Sales of Computer products was good. 电脑产品的销售良好。 However, Computer's Parts and Display Segment sale has been decreasing. 但是,计算机零件和显示分部的销售一直在下降。

Output 产量

Computer's Parts and Display Segment 

Explanation 说明

  • (?:(?:[AZ][\\w'-]*|and) +)+ Match this one or more times (?:(?:[AZ][\\w'-]*|and) +)+匹配一次或多次
    • (?:[AZ][\\w'-]*|and) Match either of the following (?:[AZ][\\w'-]*|and)匹配以下任一
      • [AZ][\\w'-]* Match any uppercase ASCII character, followed by any number of word characters, apostrophes ' or hyphens - [AZ][\\w'-]*匹配任何大写ASCII字符,后跟任意数量的单词字符,撇号'或连字符-
      • and Match this literally and从字面上匹配
    • + Match one or more spaces +匹配一个或多个空格
  • (?=[sS]ales?) Positive lookahead ensuring any of the words sale , Sale , sales , or Sales follows (?=[sS]ales?)正前瞻确保任何话的saleSalesales ,或Sales如下

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM