從多個文件中僅為特定字段獲取公共行

Question

我試圖理解以下用於使用BASH在多個文件上拉出重疊線的代碼。

awk 'END {
  # the END block is executed after
  # all the input has been read
  # loop over the rec array
  # and build the dup array indxed by the nuber of
  # filenames containing a given record
  for (R in rec) {
    n = split(rec[R], t, "/")
    if (n > 1) 
      dup[n] = dup[n] ? dup[n] RS sprintf("\t%-20s -->\t%s", rec[R], R) : \
        sprintf("\t%-20s -->\t%s", rec[R], R)
    }
  # loop over the dup array
  # and report the number and the names of the files 
  # containing the record   
  for (D in dup) {
    printf "records found in %d files:\n\n", D
    printf "%s\n\n", dup[D]
    }  
  }
{  
  # build an array named rec (short for record), indexed by 
  # the content of the current record ($0), concatenating 
  # the filenames separated by / as values
  rec[$0] = rec[$0] ? rec[$0] "/" FILENAME : FILENAME
  }' file[a-d]

在了解了每個代碼塊正在做什么之后，我想擴展此代碼以查找具有重疊的特定字段，而不是整行。 例如，我嘗試更改該行：

n = split(rec[R], t, "/")

至

n = split(rec[R$1], t, "/")

找到所有文件中第一個字段相同的行，但這不起作用。 最后我想擴展它以檢查一行是否有相同的字段1,2和4，然后打印該行。

具體來說，對於鏈接中示例中提到的文件：如果文件1是：

chr1    31237964    NP_055491.1    PUM1    M340L
chr1    33251518    NP_037543.1    AK2    H191D

和文件2是：

chr1    116944164    NP_001533.2    IGSF3    R671W
chr1    33251518    NP_001616.1    AK2    H191D
chr1    57027345    NP_001004303.2    C1orf168    P270S

我想拉出來：

file1/file2 --> chr1    33251518    AK2    H191D

我在以下鏈接中找到了此代碼： http ： //www.unix.com/shell-programming-and-scripting/140390-get-common-lines-multiple-files.html#post302437738 。 具體來說，我想了解R，rec，n，dup和D代表文件本身的含義。 從提供的評論中我不清楚，我在subloops中添加的printf語句失敗了。

非常感謝您對此有任何見解！

Answer 1

該腳本的工作原理是構建一個輔助數組，其索引是輸入文件中的行（在rec[$0]中用$0表示），值為filename1/filename3/...用於給定行的那些文件名$0存在。 您可以將其破解為僅使用$1 ， $2和$4如下所示：

awk 'END {
  # the END block is executed after
  # all the input has been read
  # loop over the rec array
  # and build the dup array indxed by the nuber of
  # filenames containing a given record
  for (R in rec) {
    n = split(rec[R], t, "/")
    if (n > 1) {
        split(R,R1R2R4,SUBSEP)
        dup[n] = dup[n] ? dup[n] RS sprintf("\t%-20s -->\t%s\t%s\t%s", rec[R], R1R2R4[1],R1R2R4[2],R1R2R4[3]) : \
          sprintf("\t%-20s -->\t%s\t%s\t%s", rec[R], R1R2R4[1],R1R2R4[2],R1R2R4[3])
      }
    }
  # loop over the dup array
  # and report the number and the names of the files 
  # containing the record   
  for (D in dup) {
    printf "records found in %d files:\n\n", D
    printf "%s\n\n", dup[D]
    }  
  }
{  
  # build an array named rec (short for record), indexed by 
  # the partial content of the current record
  # (special concatenation of $1, $2 and $4)
  # concatenating the filenames separated by / as values
  rec[$1,$2,$4] = rec[$1,$2,$4] ? rec[$1,$2,$4] "/" FILENAME : FILENAME
  }' file[a-d]

這個解決方案使用多維數組：我們創建rec[$1,$2,$4]而不是rec[$0] 。 awk這種特殊語法將索引與SUBSEP字符連接起來， SUBSEP字符默認是不可打印的（准確地說是“\\ 034” ），因此它不可能是任何一個字段的一部分。 實際上它確實rec[$1 SUBSEP $2 SUBSEP $4]=... 否則這部分代碼是相同的。 請注意，將第二個塊移動到腳本的開頭並使用END塊完成更合乎邏輯。

代碼的第一部分也必須改變：現在for (R in rec)循環這些棘手的連接索引， $1 SUBSEP $2 SUBSEP $4 。 索引時這很好，但是您需要在SUBSEP字符處split R以再次獲得可打印字段$1 ， $2 ， $4 。 這些被放入數組R1R2R4 ，可用於打印必要的輸出：而不是%s,...,R我們現在有%s\\t%s\\t%s,...,R1R2R4[1],R1R2R4[2],R1R2R4[3], . 實際上我們正在做sprintf ...%s,...,$1,$2,$4; 預先保存的字段$1 ， $2 ， $4 。 對於您的輸入示例，將打印

records found in 2 files:

    foo11.inp1/foo11.inp2 -->   chr1    33251518    AK2

請注意，輸出缺少H191D但正確如此：不在字段1,2或4中（而是在字段5中），因此無法保證打印文件中的內容相同 ！ 您可能不想打印它，或者無論如何必須指定如何處理未在文件之間檢查的列（因此可能不同）。

原始代碼的一些解釋：

rec是一個數組，其索引是輸入的完整行，值是斜線分隔的文件列表，其中出現這些行。 例如，如果file1包含一行“ foo bar ”，則最初rec["foo bar"]=="file1" 。 如果file2也包含這一行，則rec["foo bar"]=="file1/file2" 。 請注意，沒有檢查多重性，因此如果file1包含此行兩次，那么最終您將獲得rec["foo bar"]=file1/file1/file2並獲得包含此行的文件數量的3。
R在完全構建之后遍歷數組rec的索引。 這意味着R將最終假定每個輸入文件的每個唯一行，允許我們循環rec[R] ，其中包含特定行R所在的文件名。
n是split的返回值，它在每個斜杠處拆分rec[R] ---的值，即對應於R行的文件名列表。 最后，數組t填充了文件列表，但是我們沒有使用它，我們只使用數組t的長度，即行R存在的文件數（這個保存在變量中） n ）。 如果n==1 ，我們什么都不做，只有有多重性。
n的循環根據給定行的多重性創建類。 n==2適用於恰好存在於2個文件中的行。 對於那些出現三次的人來說， n==3 ，依此類推。 這個循環的作用是它構建一個數組dup ，對於每個多重類（即每個n ）創建輸出字符串"filename1/filename2/... --> R" ，每個字符串由RS分隔（記錄分隔符）表示文件中出現n次的R每個值。 因此，對於給定的n最終dup[n]將包含給定數量的字符串，形式為"filename1/filename2/... --> R" ，與RS字符連接（默認為換行符）。
然后D in dup的循環將經過多重類（即，大於1的有效值n ），並打印每個D dup[D]輸出行，其為雙dup[D] 。 因為我們只為n>1定義了dup[n] ，所以如果有多重性，則D從2開始（或者，如果沒有，則為D ，那么dup為空， D的循環不會做任何事情）。

Answer 2

首先，您需要了解AWK腳本中的3個塊：

BEGIN{
# A code that is executed once before the data processing start
}

{
# block without a name (default/main block)
# executed pet line of input
# $0 contains all line data/columns
# $1 first column
# $2 second column, and so on..
}

END{
# A code that is executed once after all data processing finished
}

所以你可能需要編輯腳本的這一部分：

  {  
  # build an array named rec (short for record), indexed by 
  # the content of the current record ($0), concatenating 
  # the filenames separated by / as values
  rec[$0] = rec[$0] ? rec[$0] "/" FILENAME : FILENAME
  }

從多個文件中僅為特定字段獲取公共行

問題描述

2 個解決方案

解決方案1
2 已采納 2015-10-14 23:20:33

解決方案2
1 2015-10-14 23:04:06

從多個文件中僅為特定字段獲取公共行

問題描述

2 個解決方案

解決方案1 2 已采納 2015-10-14 23:20:33

解決方案2 1 2015-10-14 23:04:06

解決方案1
2 已采納 2015-10-14 23:20:33

解決方案2
1 2015-10-14 23:04:06