This is a follow-up to a previous question of mine , I identified the problem more clearly and I would need some further suggestions :)
I have a string, resulting from some machine learning algorithm, which generally has the following structure:
Something like this:
Connery 3 5 7 @ 4
>> R. Moore 4 5 67| 5 [
I need to extract the 2 names and the numeric characters, and check if one of the lines starts with the special character, so my output should be:.
name_01 = 'Connery'
digits_01 = [3, 5, 7, 4]
name_02 = 'R. Moore'
digits_02 = [4, 5, 67, 5]
selected_line = 2 (anything indicating that it's the second line)
In the linked original question, I've been suggested to use:
inp = '''Connery 3 5 7 @ 4
>> R. Moore 4 5 67| 5 ['''
lines = inp.split('\n')
for line in lines:
matches = re.findall(r'\w+', line)
print(matches)
which produces a result pretty close to what I want:
['Connery', '3', '5', '7', '4']
['R', 'Moore', '4', '5', '67', '5']
But I would need the first two strings in the second line ('R', 'Moore') to be grouped together (basically, group together all the characters before the digits begin). And, it skips the detection of the special character. Should I somehow fix this output, or can I tackle the problem in a different way altogether?
I am not sure which characters you expect, want to keep or remove, but something like the following should work for the example:
inp = '''Connery 3 5 7 @ 4
>> R. Moore 4 5 67| 5 ['''
lines = inp.split('\n')
for line in lines:
matches = re.findall(r'(?:[a-zA-Z.][a-zA-Z.\s]+[a-zA-Z.])|\w+', line)
print(matches)
output:
['Connery', '3', '5', '7', '4']
['R. Moore', '4', '5', '67', '5']
NB. I included az
(lower and upper) and dot, with optional spaces in the middle: [a-zA-Z.][a-zA-Z.\\s]+[a-zA-Z.]
, but you should update to your real need.
This would also include the special characters (keep in mind that they are hardcoded, so you have to add missing ones to the regex part [>@]+
)
for line in lines:
matches = re.findall(r'(?:[a-zA-Z.][a-zA-Z.\s]+[a-zA-Z.])|\w+|[>@]+', line)
print(matches)
This is better done in several steps.
# get the whitespace at start and end out
lines = inp.strip().split('\n')
for line in lines:
# for each line, identify the selection mark, the name, and the mess at the end
# assuming names can't have numbers in them
match = re.match(r'^(\W+)?([^\d]+?)\s*([^a-zA-Z]+)$', line.strip())
if match:
selected_raw, name, numbers_raw = match.groups()
# now parse the unprocessed bits
selected = selected_raw is not None
numbers = re.findall(r'\d+', numbers_raw)
print(selected, name, numbers)
# output
False Connery ['3', '5', '7', '4']
True R. Moore ['4', '5', '67', '5']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.