简体   繁体   中英

Perl regular expression help (parse out column)

Im stuck here. Not sure why my reg ex won't work. I have a pipe delimited text file with a series of columns. I need to extract the 3rd column.

File:

A|B|C|D|E|F|G|H|I
2011-03-03 00:00:00.0|1|60510271|254735|27751|BBB|1|-0.1619023623|-0.009865904
2011-03-03 00:00:00.0|1|60510270|254735|27751|B|3|-0.0064786612|-0.0063739185
2011-03-03 00:00:00.0|1|60510269|254735|27751|B|3|-0.0084998226|-0.009244384

Regular expression:

$> head foo | perl -pi -e 's/^(.*)\|(.*)\|(.*)\|(.*)$/$3/g'

Output

-0.1619023623
-0.0064786612
-0.0084998226

Clearly not the correct column being outputted.

Thoughts ?

Normally, its easier/simpler(KISS) NOT to use regex for file format that have structured delimiters. Just split the string on "|" delimiter and get the 3rd field.

awk -F"|" '{print $3}' file

With Ruby(1.9+)

ruby -F"\|" -ane 'puts $F[2]' file

With Perl, its similar to the above Ruby one-liner.

perl -F"\|" -ane 'print $F[2]."\n"' file

.* will by default match as much as it can, so your RE is picking out the last three columns (and everything before) rather than the first three (and everything after). You can avoid this in (at least) two ways: (1) instead of .* , look for [^|]* , or (2) make your repetition operators non-greedy: .*? instead of .* .

(Or you could explicitly split the string instead of matching the whole thing with a single RE. You might want to try both approaches and see which performs better, if it matters. Splitting is likely to give longer but clearer code.)

How about using a real parser instead of hacking together a regex? Text::CSV should do the job.

my $csv = Text::CSV->new({sep_char => "|"});

您需要使模式变得贪婪-因此:

's/^(.*?)\\|(.*?)\\|(.*?)\\|(.*)$/$3/g'

First thought was Text::CSV (mentioned by Matt B), but if the data looks like the example I'd say split is the right choise.

Untested:

$> head foo | perl -le 'while (<>) { print (split m{|})[2]; }'

If you really want a regex I would use something like this:

s{^ [^\|]* \| [^\|]* \| ([^\|]*) \| .*$}{$1}gx;
(?<=\|)\d{8}

Maybe this would work (?<=\\|) positive look behind for a | character followed by 8 digits

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM