简体   繁体   English

什么正则表达式将匹配这些行?

[英]What regex will match these lines?

I'm not sure if this is the right place to post this, and sorry for the title, but I am parsing a PDF to a CSV and I've decided to go with a regex for each line due to the erratic format. 我不确定这是否是发布此内容的正确位置,对不起标题,但是我将PDF解析为CSV,由于格式不稳定,我决定对每行使用正则表达式。

I've added , to denote where the matches should be. 我添加了,以表示匹配项应该在哪里。 If you take them out, that is the raw string. 如果将它们取出,则为原始字符串。 The first line is the standard and the others are some of the ways missing columns can show up. 第一行是标准行,其他是缺少的列可以显示的某些方式。 Taking a look at the regex is kind of a good hint 看看正则表达式是一个很好的提示

It needs to match: 它需要匹配:

12,      16:00:30,  P,  14,     ______________  ABC12345678,          N,     
JOE B'obby,                    MY COMPANY-23 / NAME,                  23,  2


212,      14:00:30,,    212,     ______________  ABC12345678,          NCh,     
BOB Joe Joe,                    MY NAME,                  300,    12,      


2,      13:00:30,  P,  2,     ______________  ABC12345678,,          BOB 
Joe °,,, 20    


3,      15:15:00,  P,  132,     ______________  ABC12345678,,          PHO
Guy Guy °,,,,    

This is what I have so far. 到目前为止,这就是我所拥有的。

    sl_re = r'(\d+)' \
        r'[ ]+(\d+:\d+:\d+)' \
        r'[ ]+([P]*)' \
        r'[ ]+(\d+)' \
        r'[ ]+([_ ]+[A-Z]+\d+)' \
        r'[ ]+([A-Za-z]{,3}|[ ])' \
        r'[ ]+([\w\']+[ ][\w\'°]+[ ]{,1}[\w\'°]*[ ]{,1}[\w\'°]*)'\
        r'[ ]*([\w\-/ ]*|[ ])' \
        r'[ ]*(\d*|[ ])' \
        r'[ ]*(\d*$)'     

It matches everything up until the last 3 groups perfect, but the third to last group is too greedy and will match it all 直到最后三组完美为止,它都匹配,但是倒数第三组太贪心了,将全部匹配

Thanks to some help from @tripleee, I figured out a way to solve it. 感谢@tripleee的帮助,我找到了解决问题的方法。 The issue, as he suggested, was just being more explicit. 正如他所建议的那样,这个问题更加明确。

Because there are a lot of optional and un-foreseeable group combinations that require * (0 or more), it was important to make sure that they were non-greedy where possible. 因为有很多可选的和不可预见的组组合需要*(0或更大),所以在可能的情况下确保它们不是贪心很重要。 Using greedy searches only when I want them to match everything they possibly can (the spaces in between the groups) and non-greedy when I want it to stop at the next match. 仅当我希望他们匹配所有可能的内容(组之间的空格)时才使用贪婪搜索,而当我希望它们在下一次匹配时停止时才使用非贪婪搜索。 Very basic, but it was a good learning opportunity! 很基础,但这是一个很好的学习机会!

Only the last few lines changed, with a few chars added in that I found were needed through test cases: 仅最后几行发生了变化,并通过测试用例添加了一些我发现的字符:

r'([\d\.]+)'
r'[ ]+(\d+:\d+:\d+)'
r'[ ]+([P]*)'
r'[ ]+(\d+)'
r'[ ]+([_ ]+[A-Z]+\d+)'
r'[ ]+([NWCSLh]{,3}|[ ])'
    r'[ ]+([\w\'\-]+[ ]*?[\w©\'\-°]+[ ]*?[\w\'\-°]*'
    r'[ ]*?[\w\'\-°]*[ ]*?[\w\'\-°]*)'
r'[ ]*([A-Z0-9,\'\-\/ \.]*?)'
r'[ ]*([\d\-]*?)'
r'[ ]*([\d\-]*$)'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM