简体   繁体   中英

grep based on first column

I have a big data file called fileA having the following format

col1    0.1111,0.2222,0.33333,0.4444
col5    0.1111,0.2222,0.33333,0.4444
col3    0.1111,0.2222,0.33333,0.4444
col4    0.1111,0.2222,0.33333,0.4444

The separator between 1st and 2nd columns is \\t. Other separators are comma. I have another file containing the name of rows I am interested in, called fileB, which looks like:

col3
col1
...

Both files are not sorted. I want to retrieve all the rows from fileA with names appearing in fileB. The code grep -f fileB fileA does this job, but I think it will search all fileds in fileA, which takes long time. How can I specify only to search the 1st column in fileA?

join <(sort -t $'\t' -k 1 fileA) <(sort -t $'\t' -k 1 fileB)

Files are sorted in O(n.log(n)+p.log(p)) then they're merged in O(n+p), I don't think we can do better than that.

EDIT Ok, we can do better with a hash table which will be O(n+p).

linear time O(n) solution without sorting. (I didn't test, hope no typo):

awk -F'\t' 'NR==FNR{a[$0]=7;next}a[$1]' fileB fileA

note that the get operation on a hashtable is considered as O(1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM