简体   繁体   中英

regex catastrophic backtracking ; extracting words starts with capital before the specific word

I'm relatively new to Python world and having trouble with regex.

I'm trying to extract Firm's name before the word 'sale(s)' (or Sale(s)).

I found that Firm's names in my text data are all start with capital letter(and the other parts can be lowercase or uppercase or numbers or '-' or ', for example 'Abc Def' or 'ABC DEF' or just 'ABC' or 'Abc'),

and some of them are taking forms like ('Abc and Def' or 'Abc & Def').

For example,

from the text,

;;;;;PRINCIPAL CUSTOMERS In fiscal 2005, the Company derived approximately 21% ($4,782,852) of its consolidated revenues from continuing operations from direct transactions with Kmart Corporation. Sales of Computer products was good. However, Computer's Parts and Display Segment sale has been decreasing.

I only want to extract 'Computer's Parts and Display Segment'.

So I tried to create a regex

((?:(?:[A-Z]+[a-zA-Z\-0-9\']*\.?\s?(?:and |\& )?)+)+?(?:[S|s]ales?\s))

( 1.[AZ]+[a-zA-Z-0-9\\']*.?\\s => this part is to find words start with capital letter and other parts are composed of az or AZ or 0-9 or - or ' or . .

  1. (?:and |\\& )? => this part is to match word with and or & )

However, at https://regex101.com/ it calls out catastrophic backtracking and I read some related articles, but still cannot find way to solve this problem.

Could you help me?

Thanks!

Overview

Pointing out a few things in your pattern:

  • [a-zA-Z\\-0-9\\'] You don't need to escape ' here. Also, you can just place - at the start or end of the set and you won't need to escape it.
  • \\& The ampersand character doesn't need to be escaped.
  • [S|s] Says to match either S , | , or s , thus you could potentially match |ales . The correct way to write this is [Ss] .

Code

See regex in use here

(?:(?:[A-Z][\w'-]*|and) +)+(?=[sS]ales?)

Results

Input

;;;;;PRINCIPAL CUSTOMERS In fiscal 2005, the Company derived approximately 21% ($4,782,852) of its consolidated revenues from continuing operations from direct transactions with Kmart Corporation. Sales of Computer products was good. However, Computer's Parts and Display Segment sale has been decreasing.

Output

Computer's Parts and Display Segment 

Explanation

  • (?:(?:[AZ][\\w'-]*|and) +)+ Match this one or more times
    • (?:[AZ][\\w'-]*|and) Match either of the following
      • [AZ][\\w'-]* Match any uppercase ASCII character, followed by any number of word characters, apostrophes ' or hyphens -
      • and Match this literally
    • + Match one or more spaces
  • (?=[sS]ales?) Positive lookahead ensuring any of the words sale , Sale , sales , or Sales follows

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM