基於多個字段匹配/不匹配的awk合並行

Question

我們有一個csv：

targetID , cpd_number , assay_id , alt_assay_id , type_desc , operator ,   result_value, unit_value , experiment_date , discipline, activity_flag 
51, cpd-7788990 ,9999,0,  IC50  ,,10, uM ,  2006-07-01 00:00:00 , Biochemical ,
51, cpd-7788990 ,4444,5555, Ki , > ,5, uM ,  2006-08-01 00:00:00  ,  Enzyme ,
51, cpd-7788990 ,1212,2323,  IC50  ,,100, uM ,  2006-09-01 00:00:00  , Cell ,

我們的最終目標是：如果“ cpd_number”（$ 2）相同，但“ discipline”（$ 10）不是“ Cell”，則將“ discipline”（$ 10）行合並為“ Cell”，而不是“ Cell”一起。 （“學科”只有3個選項：生化，細胞，酶。）以下是理想的輸出。
（注意）新的“ result_value”（$ 7）=“ result_value”（$ 7），其中“ discipline”（$ 10）為“單元格”的行除以“ result_value”（$ 7），其中“ discipline”（$ 10）是“生化”或“酶”。

targetID , cpd_number , Cell_assay_id , Cell_alt_assay_id , type_desc , assay_id , alt_assay_id , type_desc ,Operator,   result_value, unit_value ,Cell_experiment_date,experiment_date, Cell_discipline , discipline 
51,cpd-7788990,1212,2323, IC50 ,9999,0,IC50,,10,uM, 2006-09-01 00:00:00 , 2006-07-01 00:00:00 ,Cell,Biochemical
51,cpd-7788990,1212,2323, IC50 ,4444,5555,Ki,>,20,uM, 2006-09-01 00:00:00 , 2006-08-01 00:00:00 ,Cell,Enzyme

一次做起來看起來很復雜。 因此，我嘗試首先合並整行：如果“ cpd_number”（$ 2）相同，但“ discipline”（$ 10）不同，則將“ discipline”（$ 10）行合並為“ Cell”行“紀律”（$ 10）不是“單元格”。 合並之后，我們可以使用awk進一步清理/重命名標頭。 任何一位大師都可以提出一些想法來編寫這種單行代碼嗎？ 這只是一個玩具例子。 實際的csv文件非常龐大，因此以/ ^ 51 /開頭可能並不理想。 謝謝！

targetID , cpd_number , assay_id , alt_assay_id , type_desc , operator ,   result_value, unit_value , experiment_date , discipline, activity_flag, targetID , cpd_number , assay_id , alt_assay_id , type_desc , operator ,   result_value, unit_value , experiment_date , discipline, activity_flag 
51, cpd-7788990 ,9999,0,  IC50  ,,10, uM ,  2006-07-01 00:00:00 , Biochemical , 51, cpd-7788990 ,1212,2323,  IC50  ,,100, uM ,  2006-09-01 00:00:00  , Cell ,
51, cpd-7788990 ,4444,5555, Ki , > ,5, uM ,  2006-08-01 00:00:00  ,  Enzyme , 51, cpd-7788990 ,1212,2323,  IC50  ,,100, uM ,  2006-09-01 00:00:00  , Cell ,

額外的例子：

targetID , cpd_number , Cell_assay_id , Cell_alt_assay_id , type_desc , assay_id , alt_assay_id , type_desc ,Operator,   result_value, unit_value ,Cell_experiment_date,experiment_date, Cell_discipline , discipline 
51, cpd-7788990 ,9999,0,  IC50  ,,10, uM ,  2006-07-01 00:00:00 , Biochemical ,
51, cpd-7788990 ,4444,5555, Ki , > ,5, uM ,  2006-08-01 00:00:00  ,  Enzyme ,
51, cpd-7788990 ,1212,2323,  IC50  ,,100, uM ,  2006-09-01 00:00:00  , Cell ,
51, cpd-7788990 ,8888,9999,  IC50  ,,200, uM ,  2006-09-01 00:00:00  , Cell ,

輸出：

targetID , cpd_number , Cell_assay_id , Cell_alt_assay_id , type_desc , assay_id , alt_assay_id , type_desc ,Operator,   result_value, unit_value ,Cell_experiment_date,experiment_date, Cell_discipline , discipline 
51,cpd-7788990,1212,2323, IC50 ,9999,0,IC50,,10,uM, 2006-09-01 00:00:00 , 2006-07-01 00:00:00 ,Cell,Biochemical
51,cpd-7788990,1212,2323, IC50 ,4444,5555,Ki,>,20,uM, 2006-09-01 00:00:00 , 2006-08-01 00:00:00 ,Cell,Enzyme
51,cpd-7788990,8888,9999, IC50 ,9999,0,IC50,,20,uM, 2006-09-01 00:00:00 , 2006-07-01 00:00:00 ,Cell,Biochemical
51,cpd-7788990,8888,9999, IC50 ,4444,5555,Ki,>,40,uM, 2006-09-01 00:00:00 , 2006-08-01 00:00:00 ,Cell,Enzyme

Answer 1

這是一個基於示例輸入和最終所需輸出的awk腳本。 隨意調整它以適應您的需求。 它應該足以使您入門。 它需要兩次通過您的csv文件。 在第一遍中，它基於學科作為單元的第二列構建一個數組，在第二遍中，將行格式化在一起。 由於您尚未說明如何處理不符合Cell規范的行，因此以下解決方案將其忽略。

script.awk的內容

BEGIN { 
    FS  = " *, *"                             # Set input field sep to this regex
    OFS = ","                                 # Set output field sep to comma
}

NR==FNR {                                     # In the first pass to the file
    if ($(NF-1) == "Cell") {                  # If the second last field is Cell
        flds[$2,$3,$4] = $3 OFS $4 OFS $5;    # Create an array to store col 3,4 and 5 separated by comma
        date[$2,$3,$4] = $9                   # Store date
        result[$2,$3,$4] = $7                 # Store col 7
    }
    next                                      # Move to the next record
} 

{                                             # For the second pass to the file
    for (fld in flds) {                       # For every entry in our array
        split (fld, tmp, SUBSEP);             # Split the composite key
        if ($(NF-1) != "Cell" && tmp[1] == $2) {   # If the last field is not Cell and first piece of key is same as col 2
            line = $0                         # Save the current line in a variable
            $3 = flds[fld] OFS $3             # modify col3 to put the value from array in front of col3
            $7 = result[fld] / $7             # Calculate the new result value
            $9 = date[fld] OFS $9             # Add the date
            $(NF-1) = "Cell" OFS $(NF-1)      # Place the Cell text
            NF--                              # Remove the last field
            print                             # print the line
            $0 = line                         # Swap the modified line back
        }
    }
}

$(NF-1) == "Cell" {                           # If the last field is Cell don't print it 
    next
}

像這樣運行：

$ awk -f script.awk file file
51,cpd-7788990,1212,2323,IC50,9999,0,IC50,,10,uM,2006-09-01 00:00:00,2006-07-01 00:00:00,Cell,Biochemical
51,cpd-7788990,8888,9999,IC50,9999,0,IC50,,20,uM,2006-09-01 00:00:00,2006-07-01 00:00:00,Cell,Biochemical
51,cpd-7788990,1212,2323,IC50,4444,5555,Ki,>,20,uM,2006-09-01 00:00:00,2006-08-01 00:00:00,Cell,Enzyme
51,cpd-7788990,8888,9999,IC50,4444,5555,Ki,>,40,uM,2006-09-01 00:00:00,2006-08-01 00:00:00,Cell,Enzyme

您可以在BEGIN塊中包括打印標題語句。

基於多個字段匹配/不匹配的awk合並行

問題描述

1 個解決方案

解決方案1
2 已采納 2014-07-01 02:49:17

基於多個字段匹配/不匹配的awk合並行

問題描述

1 個解決方案

解決方案1 2 已采納 2014-07-01 02:49:17

解決方案1
2 已采納 2014-07-01 02:49:17