简体   繁体   中英

regex - negative expression matching

Problem Introduction

So I've fried my brain trying to get negative look ahead/behinds to work. For the last example input, my current solution returns no match (see expected output table). I'm struggling with how to match the title part of the string when it includes a year that is not at the end of the string. To be clear, I'm only interested in matching the year if it is at the end of the string. The current regex fails on the last example, as it is matching NOT("Q" OR "\\d*") in the title . However, I only want it to match NOT("Q" AND "\\d{1}") . Any tips/suggestions greatly appreciated. Note using Python 3.8.

Example Input

AXP - Earnings call Q2 2021
AXP - Conference call 2021
BAC,BAC.PE,BAC.PL,BACRP,BML.PL,BML.PJ,BML.PH,BML.PG,BAC.PB,BAC.PK,BAC.PM,BAC.PN - Earnings call Q2 2021
GM - General Motors Company (GM) Presents at Deutsche Bank AutoTech Conference
AXP - American Express Company (AXP) Management Presents at Barclays 2020 Global Financial Services Conference

The period will always be of the form Q[1-4] . period and year are optional. If they do occur, they will be at the end of the string. symbol and title are always separated by - and always occur.

Expected Output

symbol title period year
AXP Earnings call Q2 2021
AXP Conference call 2021
BAC Earnings call Q2 2021
GM General Motors Company (GM) Presents at Deutsche Bank AutoTech Conference
AXP American Express Company (AXP) Management Presents at Barclays 2020 Global Financial Services Conference

What I've Tried

r"^(?P<symbol>[^\,]{1,8})(\,[A-Z\.]+)*\s\-\s(?P<title>[^Q\d]*)\s?(?P<period>Q\d)?\s?(?P<year>19|20\d{2})$"

You can use

^(?P<symbol>[^,]{1,8})(?:,[A-Z.]*)*\s+-\s+(?P<title>(?:(?!Q\d).)*?)\s*(?P<period>Q\d)?\s?(?P<year>(?:19|20)\d{2})?$

See the regex demo .

Note :

  • [^Q\\d]* is wrong as it matches any zero or more chars other than Q and digit, you need to match any text up to a Q + digit, that is, a (?:(?!Q\\d).)*? tempered greedy token
  • (?P<year>19|20\\d{2}) is obligatory, but it must be optional and 19|20 are not grouped, so \\d{2} is only applied to 20 , (?P<year>19|20\\d{2}) => (?P<year>(?:19|20)\\d{2})? .

There are other small enhancements here.

Details :

  • ^ - start of string
  • (?P<symbol>[^,]{1,8}) - Group "symbol": one to eight chars other than a comma
  • (?:,[AZ.]*)* - zero or more repetitions of a comma and then zero or more uppercase letters/dots
  • \\s+-\\s+ - a hyphen enclosed with one or more whitespaces
  • (?P<title>(?:(?!Q\\d).)*?) - Group "title": any char other than a line break char, zero or more but as few as possible occurrences, that does not start a Q +digit char sequence
  • \\s* - zero or more whitespaces
  • (?P<period>Q\\d)? - Group "period": a Q and a digit
  • \\s? - an optional whitespace
  • (?P<year>(?:19|20)\\d{2})? - an optional Group "year": 19 or 20 and then two digits
  • $ - end of string.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM