简体   繁体   中英

How do I select variable Regular expression using Python?

I have some lines like below with numbers and strings. Some have only numbers while some have some strings as well before them:

'abc'            (17245...64590)
'cde'            (12244...67730)
'dsa'            complement (12345...67890)

I would like to extract both formats with and without numbers. So, the first two lines should contain only numbers while the third line should also contain string before the numbers.

I am using this command to achieve this.

result = re.findall("\bcomplement\b|\d+", line)

Any idea, how to do it. Expected output would be like this:

17245, 64590
12244, 67730
complement, 12345, 67890

If the number of digit chunks inside the parentheses is always 2 and they are separated with 1+ dots use

re.findall(r'\s{2,}(?:(\w+)\s*)?\((\d+)\.+(\d+)\)', s)

See the regex demo . And a sample Python demo :

import re
s= ''''abc'            (17245...64590)
'cde'            (12244...67730)
'dsa'            complement (12345...67890)'''
rx = r"\s{2,}(?:(\w+)\s*)?\((\d+)\.+(\d+)\)"
for x in re.findall(rx, s):
    print(", ".join([y for y in x if y]))

Details

  • \\s{2,} - 2 or more whitespaces
  • (?:(\\w+)\\s*)? - an optional sequence of:
    • (\\w+) - Group 1: one or more word chars
    • \\s* - 0+ whitespaces
  • \\( - a (
  • (\\d+) - Group 2: one or more digits
  • \\.+ - 1 or more dots
  • (\\d+) - Group 3: one or more digits
  • \\) - a ) char.

If the number of digit chunks inside the parentheses can vary you may use

import re
s= ''''abc'            (17245...64590)
'cde'            (12244...67730)
'dsa'            complement (12345...67890)'''
for m in re.finditer(r'\s{2,}(?:(\w+)\s*)?\(([\d.]+)\)', s):
    res = []
    if m.group(1):
        res.append(m.group(1))
    res.extend(re.findall(r'\d+', m.group(2)))
    print(", ".join(res))

Both Python snippets output:

17245, 64590
12244, 67730
complement, 12345, 67890

See the online Python demo . Note it can match any number of digit chunks inside parentheses and it assumes that are at least 2 whitespace chars in between Column 1 and Column 2.

See the regex demo , too. The difference with the first one is that there is no third group, the second and third groups are replaced with one second group ([\\d.]+) that captures 1 or more dots or digits (the digits are later extracted with re.findall(r'\\d+', m.group(2)) ).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM