简体   繁体   中英

Help on Regular Expression problem

i wonder if it's possible to make a RegEx for the following data pattern:

'152: Ashkenazi A, Benlifer A, Korenblit J, Silberstein SD.'

string = '152: Ashkenazi A, Benlifer A, Korenblit J, Silberstein SD.'

I am using this Regular Expression (Using Python's re module) to extract these names:

re.findall(r'(\d+): (.+), (.+), (.+), (.+).', string, re.M | re.S)

Result:

[('152', 'Ashkenazi A', 'Benlifer A', 'Korenblit J', 'Silberstein SD')]

Now trying with a different number (less than 4 or more than 4) of name data pattern doesn't work anymore because the RegEx expects to find only 4 of them:

(.+), (.+), (.+), (.+).

I can't find a way to generalize this pattern.

A regular expression probably isn't the best way to solve this. You could use split() :

>>> s = '152: Ashkenazi A, Benlifer A, Korenblit J, Silberstein SD.'
>>> s.split(": ")
['152', 'Ashkenazi A, Benlifer A, Korenblit J, Silberstein SD.']
>>> s.split(": ")[1].split(", ")
['Ashkenazi A', 'Benlifer A', 'Korenblit J', 'Silberstein SD.']

This should do the trick if you only want the stuff after the numbers:

re.findall(r'\d+: (.+)(?:, .+)*\.', input, re.M | re.S)

And if you want everything:

re.findall(r'(\d+): (.+)(?:, .+)*\.', input, re.M | re.S)

And if you want to get them separated out into a list of matches, a nested regex will do it:

re.findall(r'[^,]+,|[^,]+$', re.findall(r'\d+: (.+)(?:, .+)*\.', input, re.M | re.S)[0],re.M|re.S)

If you means that there may be more (or less too) names, you should maybe try something like this: (\\d+): (.+)* ? Asterisk (*) means 0 or more occurrence of (.+)

I can get close, but further processing may be necessary. It is probably better to do manual string splitting, especially if the data is reliably well-formatted.

Code

import re
string1 = '152: Ashkenazi A, Benlifer A, Korenblit J, Silberstein SD.'
string2 = '152: Ashkenazi A, Benlifer A, Korenblit J, Silberstein SD, Hattingh CJR.'
for i in [string1, string2]:
    print re.findall(r'(\d+):|(?:[.,\s?])?(.*?)(?:[.,])', i)

Output

[('152', ''), ('', 'Ashkenazi A'), ('', 'Benlifer A'), ('', 'Korenblit J'), ('', 'Silberstein SD')]
[('152', ''), ('', 'Ashkenazi A'), ('', 'Benlifer A'), ('', 'Korenblit J'), ('', 'Silberstein SD'), ('', 'Hattingh CJR')]

Edit: using 2 expressions

If you are willing to use two regex expressions, it can be done fairly painlessly:

import re
string1 = '152: Ashkenazi A, Benlifer A, Korenblit J, Silberstein SD.'
string2 = '152: Ashkenazi A, Benlifer A, Korenblit J, Silberstein SD, Hattingh CJR.'
for i in [string1, string2]:
    print re.findall(r'^(\d+):', i)
    print re.findall(r'(?:[:,] )(\S+ [A-Z]+)(?=[\.,])', i)

produces

['152']
['Ashkenazi A', 'Benlifer A', 'Korenblit J', 'Silberstein SD']
['152']
['Ashkenazi A', 'Benlifer A', 'Korenblit J', 'Silberstein SD', 'Hattingh CJR']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM