Count number of row having mismatches between two columns while looping over all columns in pairwise manner awk

Question

I have a matrix (2D) having 1000s of columns (separated by tab) and 10000s of rows and I want to compare all the rows of the two columns at a time. Something like extracting two columns at a time and then comparing these two columns line by line. If the line is different in two columns then count it. Similarly proceed for another pair of columns. Comparisons have to be made in all pairs (say column 1-2, 1-3,1-4.....2-3,2-4..... and so on). The first row contains the header which needs to be printed as well to view which two columns were compared. I have tried this:

awk -vj=${array1[i]} -vk=${array2[i]} '$j !~ "NN" && $k !~ "NN" {print $j,$k}' Input.txt | awk '{if ($1 !~ $2) diff += 1; }END {print diff/NR, diff-1, NR-1}; NR==1 {print $1,$2}' >> Output.txt

where array1 and array2 are files containing the number of the columns that are to be compared which is looped via bash. This works fine for me but the time taken is too much which is obvious as each time the awk needs to read the file (size ~ 400GB) again and again. I want to know is there any way through which I can loop over every column and every row to compare them in a pairwise manner. Kindly note that any if any row contains 'NN' in any of the column compared should be excluded. Here is the sample file: Input.txt

MUN8-12 SAN1-3  SAN2-4
1   1   0
1   0   1
2   2   0
NN  0   0
0   0   NN
3   1   2
0   0   NN
0   0   0
1   NN  NN
1   2   1

The expected output wll be as: Output.txt

MUN8-12 SAN1-3
0.375   3   8
MUN8-12 SAN2-4
0.5 3   6
SAN1-3  SAN2-4
0.714285714 5   7

For the output the (1st, 3rd and 5th row) characters are the header (name of the columns compared) while the 2nd, 4th and 6th row are ratio of no of rows different between two columns and total number of columns (having no "NN" values; no of rows different between two columns (excluding header(-1)); and number of rows compared (excluding header).

Thanks for your help in advance

Best

Akanksha

Answer 1

From what I understand, the following should resemble your original code:

$ awk -v n=3 -v m=4                                                 \
      '(FNR==1){print $n,$m; next}
       ($n == "NN") || ($m == "NN") { next }
       ($n != $m) { d++ }
       { c++ }
       END { print d/c,d-1,c-1 }' file

If you want to do this for all the columns in a single go, you can do the following:

$ awk 'BEGIN{FS=OFS="\t"}
       (FNR==1) { h=$0 }
       { for(i=1;i<NF;++i) {
           if ($i == "NN") { continue }
           for(j=i+1;j<=NF;++j) {
              if ($j == "NN") { continue }
              c[i,j]+=1
              d[i,j]+=($i != $j)
           }
       }
       END { n=split(h,a)
             for(i=1;i<n;++i) {
               for(j=i+1;j<=n;++j) {
                 print a[i],a[j] ORS d[i,j]/c[i,j],d[i,j]-1,c[i,j]-1
               }
              }
       }' file

This code is not tested since we don't have access to a simple input file.

Answer 2

Split the file so you have one column per file: Eg by transposing the file, and for each row: transpose the row and save it to a file.

Then write a program that takes 2 files and does the calculation for 2 files.

Finally run that program in parallel with all combinations (run a+b, but not b+a):

parallel --plus compare_two {choose_k} ::: files* ::: files*

This will avoid reading the full 400 GB file again and again, but will only read the relevant columns. The first of the two columns will often be in cache.

Count number of row having mismatches between two columns while looping over all columns in pairwise manner awk

Question

2 answers

solution1
2 2020-08-25 13:11:06

solution2
1 ACCPTED 2020-09-10 11:43:10

Count number of row having mismatches between two columns while looping over all columns in pairwise manner awk

Question

2 answers

solution1 2 2020-08-25 13:11:06

solution2 1 ACCPTED 2020-09-10 11:43:10

solution1
2 2020-08-25 13:11:06

solution2
1 ACCPTED 2020-09-10 11:43:10