I have 2 TSV files:
TSV file 1:
A B
hello 0.5
bye 0.4
TSV file 2:
C D
hello 1
country 5
I want to join the 2 TSV files together based on file1.A=file2.C
How can i do it with the join function in linux?
Hoping to get this:
Text B D
hello 0.5 1
bye 0.4
country 5
Not getting any output with this:
join -j 1 <(sort -k1 file1.tsv) <(sort -k1 file2.tsv)
A little hairy, but here is a solution using awk
and associative arrays.
awk 'FNR == 1 {h[length(h) + 1] = $2}
FILENAME ~ /test1.tsv/ && FNR > 1 {t1[$1]=$2}
FILENAME ~ /test2.tsv/ && FNR > 1 {t2[$1]=$2}
END{print "Text\t"h[1]"\t"h[2];
for(x in t1){print x"\t"t1[x]"\t"t2[x]}
for(x in t2){print x"\t"t1[x]"\t"t2[x]}}' test1.tsv test2.tsv |
sort | uniq
File1
$ cat file1
A B
hello 0.5
bye 0.4
File2
$ cat file2
C D
hello 1
country 5
Output
$ awk 'NR==1{print "Text","B","D"}FNR==1{next}FNR==NR{A[$1]=$2;next}{print $0,(f=$1 in A ? A[$1] : ""; if(f)delete A[$1]}END{for(i in A)print i,"",A[i]}' OFS='\t' file2 file1
Text B D
hello 0.5 1
bye 0.4
country 5
Better Readable Version
awk '
# Print header when NR = 1, this happens only when awk reads first file
NR==1{print "Text","B","D"}
# Number of Records relative to the current input file.
# When awk reads from the multiple input file,
# awk NR variable will give the total number of records relative to all the input file.
# Awk FNR will give you number of records for each input file
# So when awk reads first line, stop processing and go to next line
# this is just to skip header from each input file
FNR==1{
next
}
# FNR==NR is only true while reading first file (file2)
FNR==NR{
# Build assicioative array on the first column of the file
# where array element is second column
A[$1]=$2
# Skip all proceeding blocks and process next line
next
}
{
# Check index ($1 = column1) from second argument (file1) exists in array A
# if exists variable f will be 1 (true) otherwise 0 (false)
# As long as above state is true
# print current line and element of array A where index is column1
print $0,( f=$1 in A ? A[$1] : "" )
# Delete array element corresponding to index $1, if f is true
if(f)delete A[$1]
}
# Finally in END block print array elements one by one,
# from file2 which does not exists in file1
END{
for(i in A)
print i,"",A[i]
}
' OFS='\t' file2 file1
In your title you state you want to perform an inner join
. Your example output suggests you want an outer join
.
If you want an inner join
as the title suggest, I recommend you use eBay's fabulous tsv-utils , particularly the tsv-join command as follows:
tsv-join -H --filter-file 1.tsv --key-fields A --data-fields C --append-fields B 2.tsv
No awk magic needed, just a simple well documented command with easily understandable options.
The above produces a proper inner join, you'd just need to rename the join key to text
:
C D B
hello 1 0.5
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.