Help on Regular Expression problem

Question

i wonder if it's possible to make a RegEx for the following data pattern:

'152: Ashkenazi A, Benlifer A, Korenblit J, Silberstein SD.'

string = '152: Ashkenazi A, Benlifer A, Korenblit J, Silberstein SD.'

I am using this Regular Expression (Using Python's re module) to extract these names:

re.findall(r'(\d+): (.+), (.+), (.+), (.+).', string, re.M | re.S)

Result:

[('152', 'Ashkenazi A', 'Benlifer A', 'Korenblit J', 'Silberstein SD')]

Now trying with a different number (less than 4 or more than 4) of name data pattern doesn't work anymore because the RegEx expects to find only 4 of them:

(.+), (.+), (.+), (.+).

I can't find a way to generalize this pattern.

Answer 1

A regular expression probably isn't the best way to solve this. You could use split() :

>>> s = '152: Ashkenazi A, Benlifer A, Korenblit J, Silberstein SD.'
>>> s.split(": ")
['152', 'Ashkenazi A, Benlifer A, Korenblit J, Silberstein SD.']
>>> s.split(": ")[1].split(", ")
['Ashkenazi A', 'Benlifer A', 'Korenblit J', 'Silberstein SD.']

Answer 2

This should do the trick if you only want the stuff after the numbers:

re.findall(r'\d+: (.+)(?:, .+)*\.', input, re.M | re.S)

And if you want everything:

re.findall(r'(\d+): (.+)(?:, .+)*\.', input, re.M | re.S)

And if you want to get them separated out into a list of matches, a nested regex will do it:

re.findall(r'[^,]+,|[^,]+$', re.findall(r'\d+: (.+)(?:, .+)*\.', input, re.M | re.S)[0],re.M|re.S)

Answer 3

If you means that there may be more (or less too) names, you should maybe try something like this: (\\d+): (.+)* ? Asterisk (*) means 0 or more occurrence of (.+)

Answer 4

I can get close, but further processing may be necessary. It is probably better to do manual string splitting, especially if the data is reliably well-formatted.

Code

import re
string1 = '152: Ashkenazi A, Benlifer A, Korenblit J, Silberstein SD.'
string2 = '152: Ashkenazi A, Benlifer A, Korenblit J, Silberstein SD, Hattingh CJR.'
for i in [string1, string2]:
    print re.findall(r'(\d+):|(?:[.,\s?])?(.*?)(?:[.,])', i)

Output

[('152', ''), ('', 'Ashkenazi A'), ('', 'Benlifer A'), ('', 'Korenblit J'), ('', 'Silberstein SD')]
[('152', ''), ('', 'Ashkenazi A'), ('', 'Benlifer A'), ('', 'Korenblit J'), ('', 'Silberstein SD'), ('', 'Hattingh CJR')]

Edit: using 2 expressions

If you are willing to use two regex expressions, it can be done fairly painlessly:

import re
string1 = '152: Ashkenazi A, Benlifer A, Korenblit J, Silberstein SD.'
string2 = '152: Ashkenazi A, Benlifer A, Korenblit J, Silberstein SD, Hattingh CJR.'
for i in [string1, string2]:
    print re.findall(r'^(\d+):', i)
    print re.findall(r'(?:[:,] )(\S+ [A-Z]+)(?=[\.,])', i)

produces

['152']
['Ashkenazi A', 'Benlifer A', 'Korenblit J', 'Silberstein SD']
['152']
['Ashkenazi A', 'Benlifer A', 'Korenblit J', 'Silberstein SD', 'Hattingh CJR']

Help on Regular Expression problem

Question

4 answers

solution1
6 2010-07-23 22:08:47

solution2
1 ACCPTED 2010-07-23 22:08:43

solution3
0 2010-07-23 22:09:25

solution4
0 2010-07-23 22:10:34

Code

Output

Edit: using 2 expressions

Help on Regular Expression problem

Question

4 answers

solution1 6 2010-07-23 22:08:47

solution2 1 ACCPTED 2010-07-23 22:08:43

solution3 0 2010-07-23 22:09:25

solution4 0 2010-07-23 22:10:34

Code

Output

Edit: using 2 expressions

solution1
6 2010-07-23 22:08:47

solution2
1 ACCPTED 2010-07-23 22:08:43

solution3
0 2010-07-23 22:09:25

solution4
0 2010-07-23 22:10:34