简体   繁体   中英

Python, parse string by extracting characters and digits substring

I have a string, resulting from some machine learning algorithm, which is generally formed by multiple lines. At the beginning and at the end there can be some lines not containing any characters (except for whitespaces), and in between there should be 2 lines, each containing a word followed by some numbers and (sometimes) other characters.

Something like this


first_word  3 5 7 @  4
second_word 4 5 67| 5 [


I need to extract the 2 words and the numeric characters.

I can eliminate the empty lines by doing something like:

lines_list = initial_string.split("\n")
for line in lines_list:
    if len(line) > 0 and not line.isspace():
        print(line)

but now I was wondering:

  1. if there is a more robust, general way
  2. how to parse each of the remaining 2 central lines, by extracting the words and digits (and discard the other characters mixed in between the digits, if there are any)

I imagine reg expressions could be useful, but I never really used them, so I'm struggling a little bit at the moment

I would use re.findall here:

inp = '''first_word  3 5 7 @  4
second_word 4 5 67| 5 ['''
matches = re.findall(r'\w+', inp)
print(matches)  # ['first_word', '3', '5', '7', '4', 'second_word', '4', '5', '67', '5']

If you want to process each line separately, then simply split in the input on CR?LF and use the same approach:

inp = '''first_word  3 5 7 @  4
second_word 4 5 67| 5 ['''
lines = inp.split('\n')
for line in lines:
    matches = re.findall(r'\w+', line)
    print(matches)

This prints:

['first_word', '3', '5', '7', '4']
['second_word', '4', '5', '67', '5']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM