i wonder if it's possible to make a RegEx for the following data pattern:
'152: Ashkenazi A, Benlifer A, Korenblit J, Silberstein SD.'
string = '152: Ashkenazi A, Benlifer A, Korenblit J, Silberstein SD.'
I am using this Regular Expression (Using Python's re module) to extract these names:
re.findall(r'(\d+): (.+), (.+), (.+), (.+).', string, re.M | re.S)
Result:
[('152', 'Ashkenazi A', 'Benlifer A', 'Korenblit J', 'Silberstein SD')]
Now trying with a different number (less than 4 or more than 4) of name data pattern doesn't work anymore because the RegEx expects to find only 4 of them:
(.+), (.+), (.+), (.+).
I can't find a way to generalize this pattern.
A regular expression probably isn't the best way to solve this. You could use split()
:
>>> s = '152: Ashkenazi A, Benlifer A, Korenblit J, Silberstein SD.'
>>> s.split(": ")
['152', 'Ashkenazi A, Benlifer A, Korenblit J, Silberstein SD.']
>>> s.split(": ")[1].split(", ")
['Ashkenazi A', 'Benlifer A', 'Korenblit J', 'Silberstein SD.']
This should do the trick if you only want the stuff after the numbers:
re.findall(r'\d+: (.+)(?:, .+)*\.', input, re.M | re.S)
And if you want everything:
re.findall(r'(\d+): (.+)(?:, .+)*\.', input, re.M | re.S)
And if you want to get them separated out into a list of matches, a nested regex will do it:
re.findall(r'[^,]+,|[^,]+$', re.findall(r'\d+: (.+)(?:, .+)*\.', input, re.M | re.S)[0],re.M|re.S)
If you means that there may be more (or less too) names, you should maybe try something like this: (\\d+): (.+)*
? Asterisk (*) means 0 or more occurrence of (.+)
I can get close, but further processing may be necessary. It is probably better to do manual string splitting, especially if the data is reliably well-formatted.
import re
string1 = '152: Ashkenazi A, Benlifer A, Korenblit J, Silberstein SD.'
string2 = '152: Ashkenazi A, Benlifer A, Korenblit J, Silberstein SD, Hattingh CJR.'
for i in [string1, string2]:
print re.findall(r'(\d+):|(?:[.,\s?])?(.*?)(?:[.,])', i)
[('152', ''), ('', 'Ashkenazi A'), ('', 'Benlifer A'), ('', 'Korenblit J'), ('', 'Silberstein SD')]
[('152', ''), ('', 'Ashkenazi A'), ('', 'Benlifer A'), ('', 'Korenblit J'), ('', 'Silberstein SD'), ('', 'Hattingh CJR')]
If you are willing to use two regex expressions, it can be done fairly painlessly:
import re
string1 = '152: Ashkenazi A, Benlifer A, Korenblit J, Silberstein SD.'
string2 = '152: Ashkenazi A, Benlifer A, Korenblit J, Silberstein SD, Hattingh CJR.'
for i in [string1, string2]:
print re.findall(r'^(\d+):', i)
print re.findall(r'(?:[:,] )(\S+ [A-Z]+)(?=[\.,])', i)
produces
['152']
['Ashkenazi A', 'Benlifer A', 'Korenblit J', 'Silberstein SD']
['152']
['Ashkenazi A', 'Benlifer A', 'Korenblit J', 'Silberstein SD', 'Hattingh CJR']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.