简体   繁体   中英

Regex capture all before substring

I have a string:

s = 'Abc - 33 SR 11 Kill(s) P G - (Type-1P-G) 2 Kill(s) M 1 Kill(s) S - M9A CWS 1 Kill(s) 11 Kill(s)'

I'm trying to split this up to capture the number of kills, and the information before each "XY Kill(s)" to get this output:

['Abc - 33 SR', 
 'P G - (Type-1P-G)', 
 'M', 
 'S - M9A CWS']

Getting the number of kills was simple:

re.findall(r"(\d+) Kill", s)
['11', '2', '1', '1', '11']

Getting the text has been harder. From researching, I have tried to use the following regex, which just gave the beginning of a series of capture groups:

re.findall(r"(?=[0-9]+ Kill)", s)
['', '', '', '', '', '', '']

I then changed this to add in "any number of characters before each group".

re.findall(r"(.+)(?=[0-9]+ Kill)", s)
['Abc - 33 SR 11 Kill(s) P G - (Type-1P-G) 2 Kill(s) M 1 Kill(s) S - M9A CWS 1 Kill(s) 1']

This just gives the entire string. How can I adjust this to capture everything before "any number of digits-space-Kill"?

Let's get the dupes out of the way. I've consulted the following. The second in particular looked useful but I've been unable to make it suit this purpose.

Extract Number before a Character in a String Using Python ,

How would I get everything before a : in a string Python ,

how to get the last part of a string before a certain character? .

You may use

re.findall(r'(.*?)\s*(\d+) Kill\(s\)\s*', s)

See the regex demo

Details

  • (.*?) - Capturing group 1: any 0+ chars other than line break chars, as few as possible
  • \\s* - 0+ whitespaces
  • (\\d+) - Capturing group 2: one or more digits
  • Kill(s) - a space and Kill(s) substring
  • \\s* - 0+ whitespaces

Python demo :

import re
rx = r"(.*?)\s*(\d+) Kill\(s\)\s*"
s = "Abc - 33 SR 11 Kill(s) P G - (Type-1P-G) 2 Kill(s) M 1 Kill(s) S - M9A CWS 1 Kill(s) 11 Kill(s)"
print(re.findall(rx, s))

Output:

[('Abc - 33 SR', '11'), ('P G - (Type-1P-G)', '2'), ('M', '1'), ('S - M9A CWS', '1'), ('', '11')]

You can use re.split() to get a list of all content between matches.

>>> re.split(r"\d+ Kill\(s\)", s)
    ['Abc - 33 SR ', ' P G - (Type-1P-G) ', ' M ', ' S - M9A CWS ', ' ', '']

You can clean it up to remove whitespace and empty strings.

>>> [s.strip() for s in re.split(r"\d+ Kill\(s\)", s) if s.strip()]
    ['Abc - 33 SR', 'P G - (Type-1P-G)', 'M', 'S - M9A CWS']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM