简体   繁体   English

计算两列之间不匹配的行数,同时以成对方式循环遍历所有列awk

[英]Count number of row having mismatches between two columns while looping over all columns in pairwise manner awk

I have a matrix (2D) having 1000s of columns (separated by tab) and 10000s of rows and I want to compare all the rows of the two columns at a time.我有一个矩阵 (2D),它有 1000 列(由制表符分隔)和 10000 行,我想一次比较两列的所有行。 Something like extracting two columns at a time and then comparing these two columns line by line.类似于一次提取两列,然后逐行比较这两列。 If the line is different in two columns then count it.如果两列中的行不同,则计算它。 Similarly proceed for another pair of columns.同样继续处理另一对列。 Comparisons have to be made in all pairs (say column 1-2, 1-3,1-4.....2-3,2-4..... and so on).必须在所有对中进行比较(例如第 1-2、1-3,1-4.....2-3,2-4..... 列等)。 The first row contains the header which needs to be printed as well to view which two columns were compared.第一行包含需要打印的标题,以查看比较了哪两列。 I have tried this:我试过这个:

awk -vj=${array1[i]} -vk=${array2[i]} '$j !~ "NN" && $k !~ "NN" {print $j,$k}' Input.txt | awk '{if ($1 !~ $2) diff += 1; }END {print diff/NR, diff-1, NR-1}; NR==1 {print $1,$2}' >> Output.txt

where array1 and array2 are files containing the number of the columns that are to be compared which is looped via bash.其中 array1 和 array2 是包含要比较的列数的文件,这些列数通过 bash 循环。 This works fine for me but the time taken is too much which is obvious as each time the awk needs to read the file (size ~ 400GB) again and again.这对我来说很好,但是花费的时间太多了,这很明显,因为每次 awk 都需要一次又一次地读取文件(大小 ~ 400GB)。 I want to know is there any way through which I can loop over every column and every row to compare them in a pairwise manner.我想知道有什么方法可以循环遍历每一列和每一行,以成对的方式比较它们。 Kindly note that any if any row contains 'NN' in any of the column compared should be excluded.请注意,在比较的任何列中,如果有任何行包含“NN”,则应排除。 Here is the sample file: Input.txt这是示例文件:Input.txt

MUN8-12 SAN1-3  SAN2-4
1   1   0
1   0   1
2   2   0
NN  0   0
0   0   NN
3   1   2
0   0   NN
0   0   0
1   NN  NN
1   2   1

The expected output wll be as: Output.txt预期的输出将是:Output.txt

MUN8-12 SAN1-3
0.375   3   8
MUN8-12 SAN2-4
0.5 3   6
SAN1-3  SAN2-4
0.714285714 5   7

For the output the (1st, 3rd and 5th row) characters are the header (name of the columns compared) while the 2nd, 4th and 6th row are ratio of no of rows different between two columns and total number of columns (having no "NN" values; no of rows different between two columns (excluding header(-1)); and number of rows compared (excluding header).对于输出,(第 1、第 3 和第 5 行)字符是标题(比较列的名称),而第 2、第 4 和第 6 行是两列之间不同的行数和总列数(没有“ NN”值;两列之间的行数不同(不包括标题(-1));以及比较的行数(不包括标题)。

Thanks for your help in advance提前感谢您的帮助

Best最好的事物

Akanksha阿坎克沙

From what I understand, the following should resemble your original code:据我了解,以下内容应类似于您的原始代码:

$ awk -v n=3 -v m=4                                                 \
      '(FNR==1){print $n,$m; next}
       ($n == "NN") || ($m == "NN") { next }
       ($n != $m) { d++ }
       { c++ }
       END { print d/c,d-1,c-1 }' file

If you want to do this for all the columns in a single go, you can do the following:如果您想一次性为所有列执行此操作,您可以执行以下操作:

$ awk 'BEGIN{FS=OFS="\t"}
       (FNR==1) { h=$0 }
       { for(i=1;i<NF;++i) {
           if ($i == "NN") { continue }
           for(j=i+1;j<=NF;++j) {
              if ($j == "NN") { continue }
              c[i,j]+=1
              d[i,j]+=($i != $j)
           }
       }
       END { n=split(h,a)
             for(i=1;i<n;++i) {
               for(j=i+1;j<=n;++j) {
                 print a[i],a[j] ORS d[i,j]/c[i,j],d[i,j]-1,c[i,j]-1
               }
              }
       }' file

This code is not tested since we don't have access to a simple input file.由于我们无法访问简单的输入文件,因此未测试此代码。

Split the file so you have one column per file: Eg by transposing the file, and for each row: transpose the row and save it to a file.拆分文件,使每个文件有一列:例如,通过转置文件,对于每一行:转置行并将其保存到文件中。

Then write a program that takes 2 files and does the calculation for 2 files.然后编写一个程序,该程序需要 2 个文件并计算 2 个文件。

Finally run that program in parallel with all combinations (run a+b, but not b+a):最后以所有组合并行运行该程序(运行 a+b,但不运行 b+a):

parallel --plus compare_two {choose_k} ::: files* ::: files*

This will avoid reading the full 400 GB file again and again, but will only read the relevant columns.这将避免一次又一次地读取完整的 400 GB 文件,而只会读取相关的列。 The first of the two columns will often be in cache.两列中的第一列通常在缓存中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM