简体   繁体   中英

What regex will match these lines?

I'm not sure if this is the right place to post this, and sorry for the title, but I am parsing a PDF to a CSV and I've decided to go with a regex for each line due to the erratic format.

I've added , to denote where the matches should be. If you take them out, that is the raw string. The first line is the standard and the others are some of the ways missing columns can show up. Taking a look at the regex is kind of a good hint

It needs to match:

12,      16:00:30,  P,  14,     ______________  ABC12345678,          N,     
JOE B'obby,                    MY COMPANY-23 / NAME,                  23,  2


212,      14:00:30,,    212,     ______________  ABC12345678,          NCh,     
BOB Joe Joe,                    MY NAME,                  300,    12,      


2,      13:00:30,  P,  2,     ______________  ABC12345678,,          BOB 
Joe °,,, 20    


3,      15:15:00,  P,  132,     ______________  ABC12345678,,          PHO
Guy Guy °,,,,    

This is what I have so far.

    sl_re = r'(\d+)' \
        r'[ ]+(\d+:\d+:\d+)' \
        r'[ ]+([P]*)' \
        r'[ ]+(\d+)' \
        r'[ ]+([_ ]+[A-Z]+\d+)' \
        r'[ ]+([A-Za-z]{,3}|[ ])' \
        r'[ ]+([\w\']+[ ][\w\'°]+[ ]{,1}[\w\'°]*[ ]{,1}[\w\'°]*)'\
        r'[ ]*([\w\-/ ]*|[ ])' \
        r'[ ]*(\d*|[ ])' \
        r'[ ]*(\d*$)'     

It matches everything up until the last 3 groups perfect, but the third to last group is too greedy and will match it all

Thanks to some help from @tripleee, I figured out a way to solve it. The issue, as he suggested, was just being more explicit.

Because there are a lot of optional and un-foreseeable group combinations that require * (0 or more), it was important to make sure that they were non-greedy where possible. Using greedy searches only when I want them to match everything they possibly can (the spaces in between the groups) and non-greedy when I want it to stop at the next match. Very basic, but it was a good learning opportunity!

Only the last few lines changed, with a few chars added in that I found were needed through test cases:

r'([\d\.]+)'
r'[ ]+(\d+:\d+:\d+)'
r'[ ]+([P]*)'
r'[ ]+(\d+)'
r'[ ]+([_ ]+[A-Z]+\d+)'
r'[ ]+([NWCSLh]{,3}|[ ])'
    r'[ ]+([\w\'\-]+[ ]*?[\w©\'\-°]+[ ]*?[\w\'\-°]*'
    r'[ ]*?[\w\'\-°]*[ ]*?[\w\'\-°]*)'
r'[ ]*([A-Z0-9,\'\-\/ \.]*?)'
r'[ ]*([\d\-]*?)'
r'[ ]*([\d\-]*$)'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM