简体   繁体   中英

Regex for any number of words before new line

I parsed some text in paragraph which I want to split out to insert into table.

The string looks like:

["Some text unsure how many numbers or if any special charectors etc. But I don't really care I just want all the text in this string \\n 123 some more text (50% and some more text) \\n"]

What I want do is split out the first string of text before the new line, as it is - whatever that might be. I started by trying this [A-Za-z]*\\s*[A-Za-z]*\\s* but soon realised that was not going to cut it as the text in this string is variable.

I then want to take the numbers in the second string, which the following seems to do:

\d+

Then finally I want to get the percentage in the second string, which the following seems to work for:

\d+(%)+

I'm planning on using these in a function, but am struggling to compile the regex for the first part? I'm also wondering if the regexs I have for the second 2 parts are the most efficient?

Update: Hopefully this makes it a bit more clear?

Input:

[' The first chunk of text \\n 123 the stats I want (25% the percentage I want) \\n The Second chunk of text \\n 456 the second stats I want (50% the second percentage I want) \\n The third chunk of text \\n 789 the third stats I want (75% the third percentage) \\n The fourth chunk of text \\n 101 The fourth stats (100% the fourth percentage) \\n]

Desired output: 在此处输入图片说明

2 first lines

You can use split to get the two first lines :

import re

data = ["Some text unsure how many numbers or if any special charectors etc. But I don't really care I just want all the text in this string \n 123 some more text (50% and some more text) \n"]

first_line, second_line = data[0].split("\n")[:2]
print first_line
# Some text unsure how many numbers or if any special charectors etc. But I don't really care I just want all the text in this string

digit_match = re.search('\d+(?![\d%])', second_line)
if digit_match:
    print digit_match.group()
    # 123

percent_match = re.search('\d+%', second_line)
if percent_match:
    print percent_match.group()
    # 50%

Note that if the percentage is written before the other number, \\d+ will match the percentage (without the %). I added a negative-lookahead to make sure there's no digit or % after the matched number.

Every pair

If you want to keep parsing pairs of lines :

data = [" The first chunk of text \n 123 the stats I want (25% the percentage I want) \n The Second chunk of text \n 456 the second stats I want (50% the second percentage I want) \n The third chunk of text \n 789 the third stats I want (75% the third percentage) \n The fourth chunk of text \n 101 The fourth stats (100% the fourth percentage) \n"]

import re

lines = data[0].strip().split("\n")

# TODO: Make sure there's an even number of lines
for i in range(0, len(lines), 2):
    first_line, second_line = lines[i:i + 2]

    print first_line

    digit_match = re.search('\d+(?![\d%])', second_line)
    if digit_match:
        print digit_match.group()

    percent_match = re.search('\d+%', second_line)
    if percent_match:
        print percent_match.group()

It outputs :

The first chunk of text 
123
25%
 The Second chunk of text 
456
50%
 The third chunk of text 
789
75%
 The fourth chunk of text 
101
100%

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM