简体   繁体   中英

Optimize grep operation on big files

I have two files list.txt and purchaselist.txt which are fairly large and am trying to get the latest purchase details (there are duplicates in purchaselist).

Lets say that below are the file contents:

list.txt

1111
2222
3333

purchaselist.txt

0001 1111 210.00 abcd 10 A 151234 181234 .... 
0011 1111 300.00 abcd 10 A 151000 181222 ....
0022 2222 110.00 abcd 10 E 151111 181000 ....
0099 2222 200.00 abcd 10 A 151222 181999 ....
0033 3333 110.00 abcd 10 A 151000 181222 ....
0044 0044 500.00 abcd 10 A 151999 181333 ....
8899 4444 800.00 abcd 10 A 153333 181777 ....

Am doing this using grep and a simple do while loop. Here is my command:

while read line; do tac purchaselist.txt | grep -m1 $line; done < list.txt >> result.txt

My expected output is, which am getting already looks like this:

0011 1111 300.00 abcd 10 A 151000 181222 ....
0099 2222 200.00 abcd 10 A 151222 181999 ....
0033 3333 110.00 abcd 10 A 151000 181222 ....

The above output is derived by picking the latest row from purchaselist.txt file for which I used tac . The value in list.txt appear as column number 18 in purchaselist.txt . The problem here is that files are huge. list.txt contains 580k records and looking for these records in purchaselist.txt which has ~1.7M records. The above script has been running for almost 20hrs and hasn't reached halfway. How can I optimize the processing time here?

The script is slow because for every word in list.txt you read the whole purchaselist.txt , and in your case, it will be read 580K times. In addition, bash doesn't run fast on large iteration.

If other methods are acceptable, you can use datamash :

datamash -t ' ' -g 1 last 2 < purchaselist.txt
  • -t ' ' field delimiter = space
  • -g 1 group by field 1
  • last 2 last value of field 2

BTW, 4444 is not in list.txt but shown in final output, so I assume that list.txt is not required. If that was a typo, you can use datamash -t ' ' -g 1 last 2 < purchaselist.txt | grep -f list.txt datamash -t ' ' -g 1 last 2 < purchaselist.txt | grep -f list.txt .

Furthermore, if datamash is not yet installed and you do not have privilege to install packages you can use awk instead:

awk 'ARGIND==1{a[$0]}ARGIND==2{b[$1]=$2}END{for(i in a)if(i in b)print i,b[i]}' list.txt purchaselist.txt

This command consists of three parts ARGIND == 1 ARGIND == 2 END :

  • ARGIND == 1 means argument index 1 (you may regard it as argv[1] , list.txt )
  • a[$0] $0 means the whole line, put it in a dictionary
  • b[$1] = $2 create another dictionary storing the price ( $2 , the second field) of each item ( $1 ), existent values are overwritten in this way
  • END after these two files are processed
  • for (i in a) if (i in b) if both in file.txt and purchaselist.txt
  • print i,b[i] print the key and the value

Edit For non-GNU awk , one may use

awk 'NR==FNR{a[$0];next}{b[$1]=$2}END{for(i in a)if(i in b)print i,b[i]}' list.txt purchaselist.txt

Edit OK... If you have multiple fields:

tac purchaselist.txt | sort -suk2,2 | grep -f list.txt
  • tac make newest come first
  • -s stable sort to keep the original order
  • -u take unique ones for -k2,2 (the second field) that is, only keep the first record for a specific key value
  • -k2,2 Use field from 2 to 2 as key
  • grep filter out unwanted items
$ tac purchaselist.txt | awk 'NR==FNR{a[$1]; next} $2 in a{print; delete a[$2]}' list.txt - | tac
0011 1111 300.00 abcd 10 A 151000 181222 ....
0099 2222 200.00 abcd 10 A 151222 181999 ....
0033 3333 110.00 abcd 10 A 151000 181222 ....

Change $2 to $18 if that's the matching field number in your real data. The above will work on unsorted data and shouldn't have any memory issues as it's only storing the 580k small key strings from list.txt in memory in the awk command.

The following requires that the files be sorted on the column that they are to be joined on. The examples were sorted so is it not unreasonable to assume that the real files can be or already are sorted.

join -j 1 list.txt purchaselist.txt | tac | rev | uniq -f 1 | rev | tac

I don't know if this would perform better but it at least does not contain two levels of nested loops. It correctly produce the desired output once the test inputs have been amended to include 4444 in the list.txt file.

1111 300.00
2222 200.00
3333 110.00
4444 800.00

Tip: https://unix.stackexchange.com/questions/113898/how-to-merge-two-files-based-on-the-matching-of-two-columns

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM