繁体   English   中英

正则表达式用于换行之前的任意数量的单词

[英]Regex for any number of words before new line

我在段落中解析了一些文本,希望将其拆分为表格。

该字符串如下所示:

["Some text unsure how many numbers or if any special charectors etc. But I don't really care I just want all the text in this string \\n 123 some more text (50% and some more text) \\n"]

我想要做的是将新行之前的第一个文本字符串拆分成原来的样子-不管是什么。 我首先尝试使用此[A-Za-z]*\\s*[A-Za-z]*\\s*但很快意识到,由于此字符串中的文本是可变的,因此不会削减它。

然后,我想取第二个字符串中的数字,如下所示:

\d+

最后,我想在第二个字符串中获取百分比,以下内容似乎适用于该百分比:

\d+(%)+

我正计划在函数中使用它们,但是正在为第一部分的正则表达式进行编译吗? 我也想知道我在后两个部分中使用的正则表达式是否最有效?

更新:希望这可以使它更加清楚吗?

输入:

[' The first chunk of text \\n 123 the stats I want (25% the percentage I want) \\n The Second chunk of text \\n 456 the second stats I want (50% the second percentage I want) \\n The third chunk of text \\n 789 the third stats I want (75% the third percentage) \\n The fourth chunk of text \\n 101 The fourth stats (100% the fourth percentage) \\n]

所需的输出: 在此处输入图片说明

2首行

您可以使用split获得前两行:

import re

data = ["Some text unsure how many numbers or if any special charectors etc. But I don't really care I just want all the text in this string \n 123 some more text (50% and some more text) \n"]

first_line, second_line = data[0].split("\n")[:2]
print first_line
# Some text unsure how many numbers or if any special charectors etc. But I don't really care I just want all the text in this string

digit_match = re.search('\d+(?![\d%])', second_line)
if digit_match:
    print digit_match.group()
    # 123

percent_match = re.search('\d+%', second_line)
if percent_match:
    print percent_match.group()
    # 50%

请注意,如果百分比写在其他数字之前,则\\d+将匹配该百分比(不包含%)。 我添加了一个负向超前查询 ,以确保匹配的数字后没有数字或%

每对

如果您想继续解析线对:

data = [" The first chunk of text \n 123 the stats I want (25% the percentage I want) \n The Second chunk of text \n 456 the second stats I want (50% the second percentage I want) \n The third chunk of text \n 789 the third stats I want (75% the third percentage) \n The fourth chunk of text \n 101 The fourth stats (100% the fourth percentage) \n"]

import re

lines = data[0].strip().split("\n")

# TODO: Make sure there's an even number of lines
for i in range(0, len(lines), 2):
    first_line, second_line = lines[i:i + 2]

    print first_line

    digit_match = re.search('\d+(?![\d%])', second_line)
    if digit_match:
        print digit_match.group()

    percent_match = re.search('\d+%', second_line)
    if percent_match:
        print percent_match.group()

输出:

The first chunk of text 
123
25%
 The Second chunk of text 
456
50%
 The third chunk of text 
789
75%
 The fourth chunk of text 
101
100%

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM