简体   繁体   English

根据 python 中的正则表达式匹配提取字符串之前和之后的字符串

[英]Extract string before and string after based on a regex match in python

I want to extract strings before and string after a relational operator(>,<,>=,<=,,=,=) in regex using python我想使用 python 在正则表达式中提取关系运算符(>、<、>=、<=、、=、=)之前和之后的字符串

input:输入:

Find me products where sales >= 200000 and profit > 20% by country

output output

[[sales,>=,200000],[profit,<,20%]]

I am able to get the string before the operator and the operator using我能够在运算符和运算符使用之前获取字符串

\w+(?=\s+([<>]=?|[!=]=))

How do i get the string after as well in the same list?我如何在同一个列表中获取字符串? Any help is much appreciated任何帮助深表感谢

While pyOliv's answer already gives the wanted output, your use of the positive lookahead made me wonder whether the positive lookbehind might also be worthwhile to look into.虽然 pyOliv 的回答已经给出了想要的 output,但您对积极前瞻的使用让我想知道积极的后视是否也值得研究。 That might make identifying the pattern after the relational operator more flexible, eg if you do not know how many occurrence of relational operators you have to expect.这可能会使在关系运算符之后识别模式更加灵活,例如,如果您不知道您必须期望出现多少关系运算符。 The matching pattern would be:匹配模式将是:

(?<=\s[<>!]=\s)[0-9,%]+|(?<=\s[<>=]\s)[0-9,%]+

The lookbehind has the disadvantage that it needs to know the length of the pattern it matches beforehand, so using "+", "*" or "|" lookbehind 的缺点是它需要事先知道它匹配的模式的长度,所以使用“+”、“*”或“|” within it will not work.在它里面是行不通的。 This leads to the slightly more cumbersome version, where one lookbehind is used to match the length = 2 operators, and one is used to match the length = 1 operators.这导致了稍微繁琐的版本,其中一个lookbehind用于匹配length = 2的运算符,一个用于匹配length = 1的运算符。

you need to give more details about the strings your are looking through.您需要提供有关您正在查看的字符串的更多详细信息。 Base on your example:根据您的示例:

import re
txt = 'sales >= 200,000 and profit > 20%'
match = re.match(r"(.*) ([<>=!]{1,2}) (.*) .* (.*) ([<>=!]{1,2}) (.*)", txt)
for i in range(1,6):
    print(match.group(i))

output: output:

sales
>=
200,000
profit
>

EDIT: Considering a more general case, you have this function, that give the exact output you need:编辑:考虑更一般的情况,你有这个 function,它给出了你需要的确切 output:

import re

def split_txt(txt):
    lst = re.findall(r"\w+ [<>=!]{1,2} \w+", txt)
    out = []
    for sub_list in lst:
        match = re.match(r"(\w+) ([<>=!]{1,2}) (\w+)", sub_list)
        out.append([match.group(1), match.group(2), match.group(3)])
    return out


txt = 'bbl sales >= 200,000 and profit > 20% another text id != 25'
a = split_txt(txt)
print(a)

out: [['sales', '>=', '200'], ['profit', '>', '20'], ['id', ',=', '25']] out: [['sales', '>=', '200'], ['profit', '>', '20'], ['id', ',=', '25']]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM