简体   繁体   English

组的Python正则表达式以匹配数量之前的文本

[英]Python Regular expression of group to match text before amount

I am trying to write a python regular expression which captures multiple values from a few columns in dataframe.我正在尝试编写一个 python 正则表达式,它从数据框中的几列中捕获多个值。 Below regular expression attempts to do the same.下面的正则表达式尝试做同样的事情。 There are 4 parts of the string.字符串有 4 个部分。

group 1: Date - month and day
group 2: Date - month and day
group 3: description text before amount i.e. group 4
group 4: amount  - this group is optional

Some peculiar conditions for group 3 - text that (1)the text itself might contain characters like "-" , "$".第 3 组的一些特殊条件 - 文本(1)文本本身可能包含诸如 "-" 、 "$" 之类的字符。 So we cannot use - & $ as the boundary of text.所以我们不能使用 - & $ 作为文本的边界。 (2) The text (group 3) sometimes may not be followed by amount. (2) 文本(第 3 组)有时后面可能没有数量。 (3) Empty space between group 3 and 4 is optional (3) 第3组和第4组之间的空格是可选的

Below is python function code which takes in a dataframe having 4 columns c1,c2,c3,c4 adds the columns dt, txt and amt after processing to dataframe.下面是 python 函数代码,它接收一个具有 4 列 c1,c2,c3,c4 的数据帧,在处理到数据帧后添加列 dt、txt 和 amt。

def parse_values(args):
    re_1='(([JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC]{3}\s{0,}[\d]{1,2})\s{0,}){2}(.*[\s]|.*[^\$]|.*[^-]){1}([-+]?\$[\d|,]+(?:\.\d+)?)?'
    srch=re.search(re_1, args[0])
    if srch is None:
        return args
    m = re.match(re_1, args[0])
    args['dt']=m.group(1)
    args['txt']=m.group(3)
    args['amt']=m.group(4)
    if m.group(4) is None:
        if pd.isnull(args['c3']):
            args['amt']=args.c2
        else:
            args['amt']=args.c3
    return args

And in order to test the results I have below 6 rows which needs to return a properly formatted amt column in return.为了测试结果,我有 6 行以下需要返回格式正确的 amt 列作为回报。

tt=[{'c1':'OCT 7 OCT 8 HURRY CURRY THORNHILL ','c2':'$16.84'},
        {'c1':'OCT 7 OCT 8 HURRY CURRY THORNHILL','c2':'$16.84'},
        {'c1':'MAR 15 MAR 16 LOBLAWS FOODS INC - EAST YORK -$80,00,7770.70'},
        {'c1':'MAR 15 MAR 16 LOBLAWS FOODS INC - EAST YORK-$2070.70'},
        {'c1':'MAR 15 MAR 16 LOBLAWS FOODS INC - EAST YORK$2070.70'},
        {'c1':'MAR 15 MAR 16 LOBLAWS FOODS INC - EAST YORK $80,00,7770.70'}
    ]
t=pd.DataFrame(tt,columns=['c1','c2','c3','c4'])
t=t.apply(parse_values,1)
t

However due to the error in my regular expression in re_1 I am not getting the amt column and txt column parsed properly as they return NaN or miss some words (as dipicted in some rows of the output image below).但是,由于 re_1 中我的正则表达式中的错误,我没有正确解析 amt 列和 txt 列,因为它们返回 NaN 或遗漏了一些单词(如下面输出图像的某些行所示)。

在此处输入图片说明

How about this:这个怎么样:

(((?:JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)\s*[\d]{1,2})\s*){2}(.*?)\s*(?=[\-$])([-+]?\$[\d|,]+(?:\.\d+)?)

As seen at regex101.comregex101.com 所示

Explanation:解释:

First off, I've shortened the regex by changing a few minor details like using \\s* instead of \\s{0,} , which mean the exact same thing.首先,我通过更改一些小细节来缩短正则表达式,例如使用\\s*而不是\\s{0,} ,这意味着完全相同的事情。

The whole [Jan|...|DEC] code was using a character class ie [] , whcih only takes a single character from the entire set.整个[Jan|...|DEC]代码使用了一个字符类,即[] ,它只从整个集合中获取一个字符。 Using non capturing groups is the correct way of selecting from different groups of multiple letters, which in your case are 'months'.使用非捕获组是从多个字母的不同组中进行选择的正确方法,在您的情况下是“月”。

The meat of the regex: LOOKAHEADS正则表达式的核心: LOOKAHEADS

(?=[\\-$]) tells the regex that the text before it in (.*) should match as much as it can until it finds a position followed by a dash or a dollar sign. (?=[\\-$])告诉正则表达式它在(.*)之前的文本应该尽可能多地匹配,直到它找到一个位置后跟一个破折号或美元符号。 Lookaheads don't actually match whatever they're looking for, they just tell the regex that the lookahead's arguments should be following that position.前瞻实际上并不匹配他们正在寻找的任何东西,它们只是告诉正则表达式前瞻的参数应该遵循该位置。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM