I have two files list.txt
and purchaselist.txt
which are fairly large and am trying to get the latest purchase details (there are duplicates in purchaselist).
Lets say that below are the file contents:
list.txt
1111
2222
3333
purchaselist.txt
0001 1111 210.00 abcd 10 A 151234 181234 ....
0011 1111 300.00 abcd 10 A 151000 181222 ....
0022 2222 110.00 abcd 10 E 151111 181000 ....
0099 2222 200.00 abcd 10 A 151222 181999 ....
0033 3333 110.00 abcd 10 A 151000 181222 ....
0044 0044 500.00 abcd 10 A 151999 181333 ....
8899 4444 800.00 abcd 10 A 153333 181777 ....
Am doing this using grep
and a simple do while loop. Here is my command:
while read line; do tac purchaselist.txt | grep -m1 $line; done < list.txt >> result.txt
My expected output is, which am getting already looks like this:
0011 1111 300.00 abcd 10 A 151000 181222 ....
0099 2222 200.00 abcd 10 A 151222 181999 ....
0033 3333 110.00 abcd 10 A 151000 181222 ....
The above output is derived by picking the latest row from purchaselist.txt
file for which I used tac
. The value in list.txt
appear as column number 18 in purchaselist.txt
. The problem here is that files are huge. list.txt
contains 580k records and looking for these records in purchaselist.txt
which has ~1.7M records. The above script has been running for almost 20hrs and hasn't reached halfway. How can I optimize the processing time here?
The script is slow because for every word in list.txt
you read the whole purchaselist.txt
, and in your case, it will be read 580K times. In addition, bash doesn't run fast on large iteration.
If other methods are acceptable, you can use datamash
:
datamash -t ' ' -g 1 last 2 < purchaselist.txt
-t ' '
field delimiter = space -g 1
group by field 1 last 2
last value of field 2 BTW, 4444
is not in list.txt
but shown in final output, so I assume that list.txt
is not required. If that was a typo, you can use datamash -t ' ' -g 1 last 2 < purchaselist.txt | grep -f list.txt
datamash -t ' ' -g 1 last 2 < purchaselist.txt | grep -f list.txt
.
Furthermore, if datamash
is not yet installed and you do not have privilege to install packages you can use awk
instead:
awk 'ARGIND==1{a[$0]}ARGIND==2{b[$1]=$2}END{for(i in a)if(i in b)print i,b[i]}' list.txt purchaselist.txt
This command consists of three parts ARGIND == 1
ARGIND == 2
END
:
ARGIND == 1
means argument index 1 (you may regard it as argv[1]
, list.txt
) a[$0]
$0 means the whole line, put it in a dictionary b[$1] = $2
create another dictionary storing the price ( $2
, the second field) of each item ( $1
), existent values are overwritten in this way END
after these two files are processed for (i in a) if (i in b)
if both in file.txt
and purchaselist.txt
print i,b[i]
print the key and the value Edit For non-GNU awk
, one may use
awk 'NR==FNR{a[$0];next}{b[$1]=$2}END{for(i in a)if(i in b)print i,b[i]}' list.txt purchaselist.txt
Edit OK... If you have multiple fields:
tac purchaselist.txt | sort -suk2,2 | grep -f list.txt
tac
make newest come first -s
stable sort to keep the original order -u
take unique ones for -k2,2
(the second field) that is, only keep the first record for a specific key value -k2,2
Use field from 2 to 2 as key grep
filter out unwanted items $ tac purchaselist.txt | awk 'NR==FNR{a[$1]; next} $2 in a{print; delete a[$2]}' list.txt - | tac
0011 1111 300.00 abcd 10 A 151000 181222 ....
0099 2222 200.00 abcd 10 A 151222 181999 ....
0033 3333 110.00 abcd 10 A 151000 181222 ....
Change $2 to $18 if that's the matching field number in your real data. The above will work on unsorted data and shouldn't have any memory issues as it's only storing the 580k small key strings from list.txt in memory in the awk command.
The following requires that the files be sorted on the column that they are to be joined on. The examples were sorted so is it not unreasonable to assume that the real files can be or already are sorted.
join -j 1 list.txt purchaselist.txt | tac | rev | uniq -f 1 | rev | tac
I don't know if this would perform better but it at least does not contain two levels of nested loops. It correctly produce the desired output once the test inputs have been amended to include 4444
in the list.txt
file.
1111 300.00
2222 200.00
3333 110.00
4444 800.00
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.