简体   繁体   中英

Unix join two files with regular expressions using awk

I have one file (lookup.txt) which contains a lookup table consisting of a list of regular expressions, with corresponding data (categories, and periods). eg

INTERNODE|household/bills/broadband|monthly
ORIGIN ENERGY|household/bills/electricity|quarterly
TELSTRA.*BILL|household/bills/phone|quarterly
OPTUS|household/bills/mobile|quarterly
SKYPE|household/bills/skype|non-periodic

I have another file (data.txt) which contains a list of expenses, eg:

2009-10-31,cc,-39.9,INTERNODE BROADBAND
2009-10-31,cc,-50,ORIGIN ENERGY 543546
2009-10-31,cc,-68,INTERNODE BROADBAND EXCESS CHARGES
2009-10-31,cc,-90,TELSTRA MOBILE BILL
2009-11-02,cc,-320,TELSTRA HOME BILL
2009-11-03,cc,-22.96,DICK SMITH
2009-11-03,cc,-251.24,BUNNINGS
2009-11-04,cc,-4.2,7-ELEVEN

I want to join these two together, whereby the 4th column in data.txt file matches the regular expression from the first column of the lookup.txt file.

So the output would be:

2009-10-31,cc,-39.9,INTERNODE BROADBAND,household/bills/broadband,monthly
2009-10-31,cc,-50,ORIGIN ENERGY 543546,household/bills/electricity,quarterly
2009-10-31,cc,-68,INTERNODE BROADBAND EXCESS CHARGES,household/bills/broadband,monthly
2009-10-31,cc,-90,TELSTRA MOBILE BILL,household/bills/phone,quarterly
2009-11-02,cc,-320,TELSTRA HOME BILL,household/bills/phone,quarterly
2009-11-03,cc,-22.96,DICK SMITH
2009-11-03,cc,-251.24,BUNNINGS
2009-11-04,cc,-4.2,7-ELEVEN

I've acheived this using a bash loop, looping over the lookup, doing greps and adding extra columns on using sed, but it is very slow. So was wondering if there was a faster method of doing this, saying using awk.

Any help would be appreciated.

$ awk -F'|' 'FNR==NR{a[$1]=$2","$3;next}{m=split($0,b,",");for(i in a){if(b[4]~i){print $0","a[i];next}}}1' lookup file
2009-10-31,cc,-39.9,INTERNODE BROADBAND,household/bills/broadband,monthly
2009-10-31,cc,-50,ORIGIN ENERGY 543546,household/bills/electricity,quarterly
2009-10-31,cc,-68,INTERNODE BROADBAND EXCESS CHARGES,household/bills/broadband,monthly
2009-10-31,cc,-90,TELSTRA MOBILE BILL,household/bills/phone,quarterly
2009-11-02,cc,-320,TELSTRA HOME BILL,household/bills/phone,quarterly
2009-11-03,cc,-22.96,DICK SMITH
2009-11-03,cc,-251.24,BUNNINGS
2009-11-04,cc,-4.2,7-ELEVEN

You can do it in Python:

#!/usr/bin/python
import csv, re
lookup = []
with open('lookup.txt') as f:
    for rec in csv.reader(f, delimiter='|'):
        lookup.append((re.compile(rec[0]), rec[1:]))
with open('data.txt') as f:
    for rec in csv.reader(f, delimiter=','):
        for rexp, fields in lookup:
            if rexp.match(rec[3]):
                rec.extend(fields)
                break
        print ','.join(rec)

For your files lookup.txt and data.txt it returns the following in less than 0.3s:

2009-10-31,cc,-39.9,INTERNODE BROADBAND,household/bills/broadband,monthly
2009-10-31,cc,-50,ORIGIN ENERGY 543546,household/bills/electricity,quarterly
2009-10-31,cc,-68,INTERNODE BROADBAND EXCESS CHARGES,household/bills/broadband,monthly
2009-10-31,cc,-90,TELSTRA MOBILE BILL,household/bills/phone,quarterly
2009-11-02,cc,-320,TELSTRA HOME BILL,household/bills/phone,quarterly
2009-11-03,cc,-22.96,DICK SMITH
2009-11-03,cc,-251.24,BUNNINGS
2009-11-04,cc,-4.2,7-ELEVEN

You can do it in Perl. The advantage of Perl (or Python) is they have libraries for dealing with CSV files. Your examples are simple enough, but what happens if you have a comma inside double quotes? Or what about utf8? etc.

The standard Perl library for this is Text:CSV_XS . However, its a bit verbose and I prefer Parse::CSV which is a wrapper around Text::CSV_XS.

#!/usr/bin/perl

use strict;
use warnings;
use Parse::CSV;

my %lookup;
my $l = Parse::CSV->new(file => "lookup.txt", sep_char => '|');
while (my $row = $l->fetch) {
   my $key = qr/$row->[0]/;
   $lookup{$key} = [$row->[1,]];
}

my $d = Parse::CSV->new(file => "data.txt");
while (my $row = $d->fetch) {
   foreach my $regex (keys %lookup) {
      if ($row->[3] =~ $regex) {
         push @$row, @{$lookup{$regex}};
         last;
      }
   }
   print join(",", @$row), "\n";
}

If you didn't have the regexs, you could use join . How many regexps does lookup.txt have? If it's just that one, just expand it and drop that feature.

Awk is really designed to process a single stream of data one record at a time, so it isn't the right tool for this job. It would be a ten-minute exercise in Perl or another language that's more oriented toward general-purpose programming.

If you're bent on doing it all in awk, write one script to generate a second awk script from your lookup file that processes the data, then run the second script.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM