简体   繁体   中英

Join two TSV files with inner join

I have 2 TSV files:

TSV file 1:
    A      B 
  hello   0.5
  bye     0.4

TSV file 2:
 C        D
hello     1
country   5

I want to join the 2 TSV files together based on file1.A=file2.C

How can i do it with the join function in linux?

Hoping to get this:

Text     B    D
hello    0.5  1
bye      0.4  
country       5

Not getting any output with this:

join -j 1 <(sort -k1 file1.tsv) <(sort -k1 file2.tsv) 

A little hairy, but here is a solution using awk and associative arrays.

awk 'FNR == 1 {h[length(h) + 1] = $2}
     FILENAME ~ /test1.tsv/ && FNR > 1 {t1[$1]=$2}
     FILENAME ~ /test2.tsv/ && FNR > 1 {t2[$1]=$2}
     END{print "Text\t"h[1]"\t"h[2];
         for(x in t1){print x"\t"t1[x]"\t"t2[x]}
         for(x in t2){print x"\t"t1[x]"\t"t2[x]}}' test1.tsv test2.tsv | 
  sort | uniq

File1

$ cat file1
A      B 
hello   0.5
bye     0.4

File2

$ cat file2
C        D
hello     1
country   5

Output

$ awk 'NR==1{print "Text","B","D"}FNR==1{next}FNR==NR{A[$1]=$2;next}{print $0,(f=$1 in A ? A[$1] : ""; if(f)delete A[$1]}END{for(i in A)print i,"",A[i]}' OFS='\t' file2 file1
Text    B   D
hello   0.5 1
bye     0.4 
country     5

Better Readable Version

awk '
     # Print header when NR = 1, this happens only when awk reads first file
     NR==1{print "Text","B","D"}

     # Number of Records relative to the current input file. 
     # When awk reads from the multiple input file, 
     # awk NR variable will give the total number of records relative to all the input file. 
     # Awk FNR will give you number of records for each input file
     # So when awk reads first line, stop processing and go to next line
     # this is just to skip header from each input file
     FNR==1{
             next
           }

     # FNR==NR is only true while reading first file (file2)
     FNR==NR{
              # Build assicioative array on the first column of the file
              # where array element is second column
              A[$1]=$2

              # Skip all proceeding blocks and process next line
              next 
            }
           {
              # Check index ($1 = column1) from second argument (file1) exists in array A 
              # if exists variable f will be 1 (true) otherwise 0 (false)
              # As long as above state is true
              # print current line and element of array A where index is column1
              print $0,( f=$1 in A ? A[$1] : "" )

              # Delete array element corresponding to index $1, if f is true
              if(f)delete A[$1]
            }

         # Finally in END block print array elements one by one,
         # from file2 which does not exists in file1
         END{
               for(i in A)
                   print i,"",A[i]
            }
    ' OFS='\t' file2 file1

In your title you state you want to perform an inner join . Your example output suggests you want an outer join .

If you want an inner join as the title suggest, I recommend you use eBay's fabulous tsv-utils , particularly the tsv-join command as follows:

tsv-join -H --filter-file 1.tsv --key-fields A --data-fields C --append-fields B 2.tsv

No awk magic needed, just a simple well documented command with easily understandable options.

The above produces a proper inner join, you'd just need to rename the join key to text :

C       D       B
hello   1       0.5

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM