简体   繁体   中英

python split string on the all caps words

I have a series of textfiles formatted as follows:

text = 'COMPANY NAME:   Ruff name of company TYPE OF EVENT: Party NOTIFIED DATE: 1/27/20   COMPANY NAME: Company2/CPT TYPE OF EVENT: Fire NOTIFIED DATE: 1/31/20'

I eventually need to get these into a pandas dataframe where COMPANY NAME , TYPE OF EVENT , NOTIFIED DATE are the column headers and the text in between fill up rows. A first step is just to figure out how to split the text wherever there is a ":" preceded by one or more all caps words. So, some output like:

res = ['COMPANY NAME', 'Ruff name of company', 'TYPE OF EVENT', 'PARTY', etc]

I am very new to regex and cannot figure out how to get this match to work. I tried the following:

re.findall('[A-Z]+[A-Z]+[A-Z]', text)

I recognize I'm not even close. I have also looked at lots of other similar questions and failed to adapt them to my use case.

Other posts:

Capture all consecutive all-caps words with regex in python?

Python Regex catch multi caps words and adjacent words

Find the line with all caps in Regex Python

Any help would be appreciated, thanks!

Your values after matching all uppercase chars and a colon : can start with another uppercase char or a digit.

One option is to use re.findall and get the values using 2 capturing groups. This will return tuples of the 2 group values.

You might use:

\b([A-Z]+(?:[^\S\r\n]+[A-Z]+)*):[^\S\r\n]+([A-Z0-9].*?(?= [A-Z]|$))

The pattern will match

  • \b Word boundary
  • ( Capture group 1
    • [AZ]+ Match 1+ uppercase chars
    • (?:[^\S\r\n]+[AZ]+)* Optionally repeat 1+ whitespace chars and 1+ uppercase chars
  • ): Close group 1 and match the colon
  • [^\S\r\n]+ Match 1+ whitespace chars wihout a newline
  • ( Capture group 2
    • [A-Z0-9] Match an uppercase char AZ or a digit
    • .*? Match any char except a newline as least as possible
    • (?= [AZ]|$) Positve lookahead, assert what is in the right is a space and either an uppercase char AZ or the end of the string. (use \Z if there can not be a following newline)
  • ) Close group 2

Regex demo | Python demo

For example

import re

regex = r"\b([A-Z]+(?:[^\S\r\n]+[A-Z]+)*):[^\S\r\n]+([A-Z0-9].*?(?= [A-Z]|$))"
test_str = "COMPANY NAME:   Ruff name of company TYPE OF EVENT: Party NOTIFIED DATE: 1/27/20   COMPANY NAME: Company2/CPT TYPE OF EVENT: Fire NOTIFIED DATE: 1/31/20"
print(re.findall(regex, test_str))

Output

[('COMPANY NAME', 'Ruff name of company'), ('TYPE OF EVENT', 'Party'), ('NOTIFIED DATE', '1/27/20  '), ('COMPANY NAME', 'Company2/CPT'), ('TYPE OF EVENT', 'Fire'), ('NOTIFIED DATE', '1/31/20')]

To get all items in a list as in your question, you might also use re.finditer and append the group values to a list. See another Python demo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM