[英]How do I use awk/sed to merge a field across multiple rows based on matching column values?
[英]awk merge lines based on multiple field matching/unmatching
我們有一個csv:
targetID , cpd_number , assay_id , alt_assay_id , type_desc , operator , result_value, unit_value , experiment_date , discipline, activity_flag
51, cpd-7788990 ,9999,0, IC50 ,,10, uM , 2006-07-01 00:00:00 , Biochemical ,
51, cpd-7788990 ,4444,5555, Ki , > ,5, uM , 2006-08-01 00:00:00 , Enzyme ,
51, cpd-7788990 ,1212,2323, IC50 ,,100, uM , 2006-09-01 00:00:00 , Cell ,
我們的最終目標是:如果“ cpd_number”($ 2)相同,但“ discipline”($ 10)不是“ Cell”,則將“ discipline”($ 10)行合並為“ Cell”,而不是“ Cell”一起。 (“學科”只有3個選項:生化,細胞,酶。)以下是理想的輸出。
(注意)新的“ result_value”($ 7)=“ result_value”($ 7),其中“ discipline”($ 10)為“單元格”的行除以“ result_value”($ 7),其中“ discipline”($ 10)是“生化”或“酶”。
targetID , cpd_number , Cell_assay_id , Cell_alt_assay_id , type_desc , assay_id , alt_assay_id , type_desc ,Operator, result_value, unit_value ,Cell_experiment_date,experiment_date, Cell_discipline , discipline
51,cpd-7788990,1212,2323, IC50 ,9999,0,IC50,,10,uM, 2006-09-01 00:00:00 , 2006-07-01 00:00:00 ,Cell,Biochemical
51,cpd-7788990,1212,2323, IC50 ,4444,5555,Ki,>,20,uM, 2006-09-01 00:00:00 , 2006-08-01 00:00:00 ,Cell,Enzyme
一次做起來看起來很復雜。 因此,我嘗試首先合並整行:如果“ cpd_number”($ 2)相同,但“ discipline”($ 10)不同,則將“ discipline”($ 10)行合並為“ Cell”行“紀律”($ 10)不是“單元格”。 合並之后,我們可以使用awk進一步清理/重命名標頭。 任何一位大師都可以提出一些想法來編寫這種單行代碼嗎? 這只是一個玩具例子。 實際的csv文件非常龐大,因此以/ ^ 51 /開頭可能並不理想。 謝謝!
targetID , cpd_number , assay_id , alt_assay_id , type_desc , operator , result_value, unit_value , experiment_date , discipline, activity_flag, targetID , cpd_number , assay_id , alt_assay_id , type_desc , operator , result_value, unit_value , experiment_date , discipline, activity_flag
51, cpd-7788990 ,9999,0, IC50 ,,10, uM , 2006-07-01 00:00:00 , Biochemical , 51, cpd-7788990 ,1212,2323, IC50 ,,100, uM , 2006-09-01 00:00:00 , Cell ,
51, cpd-7788990 ,4444,5555, Ki , > ,5, uM , 2006-08-01 00:00:00 , Enzyme , 51, cpd-7788990 ,1212,2323, IC50 ,,100, uM , 2006-09-01 00:00:00 , Cell ,
額外的例子:
targetID , cpd_number , Cell_assay_id , Cell_alt_assay_id , type_desc , assay_id , alt_assay_id , type_desc ,Operator, result_value, unit_value ,Cell_experiment_date,experiment_date, Cell_discipline , discipline
51, cpd-7788990 ,9999,0, IC50 ,,10, uM , 2006-07-01 00:00:00 , Biochemical ,
51, cpd-7788990 ,4444,5555, Ki , > ,5, uM , 2006-08-01 00:00:00 , Enzyme ,
51, cpd-7788990 ,1212,2323, IC50 ,,100, uM , 2006-09-01 00:00:00 , Cell ,
51, cpd-7788990 ,8888,9999, IC50 ,,200, uM , 2006-09-01 00:00:00 , Cell ,
輸出:
targetID , cpd_number , Cell_assay_id , Cell_alt_assay_id , type_desc , assay_id , alt_assay_id , type_desc ,Operator, result_value, unit_value ,Cell_experiment_date,experiment_date, Cell_discipline , discipline
51,cpd-7788990,1212,2323, IC50 ,9999,0,IC50,,10,uM, 2006-09-01 00:00:00 , 2006-07-01 00:00:00 ,Cell,Biochemical
51,cpd-7788990,1212,2323, IC50 ,4444,5555,Ki,>,20,uM, 2006-09-01 00:00:00 , 2006-08-01 00:00:00 ,Cell,Enzyme
51,cpd-7788990,8888,9999, IC50 ,9999,0,IC50,,20,uM, 2006-09-01 00:00:00 , 2006-07-01 00:00:00 ,Cell,Biochemical
51,cpd-7788990,8888,9999, IC50 ,4444,5555,Ki,>,40,uM, 2006-09-01 00:00:00 , 2006-08-01 00:00:00 ,Cell,Enzyme
這是一個基於示例輸入和最終所需輸出的awk腳本。 隨意調整它以適應您的需求。 它應該足以使您入門。 它需要兩次通過您的csv文件。 在第一遍中,它基於學科作為單元的第二列構建一個數組,在第二遍中,將行格式化在一起。 由於您尚未說明如何處理不符合Cell規范的行,因此以下解決方案將其忽略。
script.awk的內容
BEGIN {
FS = " *, *" # Set input field sep to this regex
OFS = "," # Set output field sep to comma
}
NR==FNR { # In the first pass to the file
if ($(NF-1) == "Cell") { # If the second last field is Cell
flds[$2,$3,$4] = $3 OFS $4 OFS $5; # Create an array to store col 3,4 and 5 separated by comma
date[$2,$3,$4] = $9 # Store date
result[$2,$3,$4] = $7 # Store col 7
}
next # Move to the next record
}
{ # For the second pass to the file
for (fld in flds) { # For every entry in our array
split (fld, tmp, SUBSEP); # Split the composite key
if ($(NF-1) != "Cell" && tmp[1] == $2) { # If the last field is not Cell and first piece of key is same as col 2
line = $0 # Save the current line in a variable
$3 = flds[fld] OFS $3 # modify col3 to put the value from array in front of col3
$7 = result[fld] / $7 # Calculate the new result value
$9 = date[fld] OFS $9 # Add the date
$(NF-1) = "Cell" OFS $(NF-1) # Place the Cell text
NF-- # Remove the last field
print # print the line
$0 = line # Swap the modified line back
}
}
}
$(NF-1) == "Cell" { # If the last field is Cell don't print it
next
}
像這樣運行:
$ awk -f script.awk file file
51,cpd-7788990,1212,2323,IC50,9999,0,IC50,,10,uM,2006-09-01 00:00:00,2006-07-01 00:00:00,Cell,Biochemical
51,cpd-7788990,8888,9999,IC50,9999,0,IC50,,20,uM,2006-09-01 00:00:00,2006-07-01 00:00:00,Cell,Biochemical
51,cpd-7788990,1212,2323,IC50,4444,5555,Ki,>,20,uM,2006-09-01 00:00:00,2006-08-01 00:00:00,Cell,Enzyme
51,cpd-7788990,8888,9999,IC50,4444,5555,Ki,>,40,uM,2006-09-01 00:00:00,2006-08-01 00:00:00,Cell,Enzyme
您可以在BEGIN
塊中包括打印標題語句。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.