[英]Print the Search pattern in awk
我想打印匹配的搜索模式,然后计算平均行数。 最好的例子是:
输入文件:
chr17 41275978 41276294 BRCA1_ex02_01 278
chr17 41275978 41276294 BRCA1_ex02_01 279
chr17 41275978 41276294 BRCA1_ex02_01 280
chr17 41275978 41276294 BRCA1_ex02_02 281
chr17 41275978 41276294 BRCA1_ex02_02 282
chr17 41275978 41276294 BRCA1_ex02_03 283
chr17 41275978 41276294 BRCA1_ex02_03 284
chr17 41275978 41276294 BRCA1_ex02_03 285
chr17 41275978 41276294 BRCA1_ex02_04 286
chr17 41275978 41276294 BRCA1_ex02_04 287
chr17 41275978 41276294 BRCA1_ex02_04 288
我想在bash循环中提取相同的第四列:
输出1:
chr17 41275978 41276294 BRCA1_ex02_01 278
chr17 41275978 41276294 BRCA1_ex02_01 279
chr17 41275978 41276294 BRCA1_ex02_01 280
输出2:
chr17 41275978 41276294 BRCA1_ex02_02 281
chr17 41275978 41276294 BRCA1_ex02_02 282
输出3:
chr17 41275978 41276294 BRCA1_ex02_03 283
chr17 41275978 41276294 BRCA1_ex02_03 284
chr17 41275978 41276294 BRCA1_ex02_03 285
等等,依此类推。然后计算第5列的平均值非常容易:
awk'END {sum + = $ 5} {print NR / sum}'in_file.txt
在我的情况下,有数千行BRCA1_exXX_XX-那么有什么主意可以拆分吗?
保罗
我认为这会做您想要的。
awk '{
# Keep running sum of fifth column based on value of fourth column.
v[$4]+=$5;
# Keep count of lines with similar fourth column values.
n[$4]++
}
END {
# Loop over all the values we saw and print out their fourth columns and the sum of the fifth columns.
for (val in n) {
print val ": " v[val] / n[val]
}
}' $file
假定条目按照给定数据中的第四列进行排序,则可以这样进行:
awk '
$4 != prev { # if this line's 4th column is different from the previous line
if (cnt > 0) # if count of lines is greater than 0
print prev, sum / cnt # print the average
prev = $4 # save previous 4th column
sum = $5 # initialize sum to column 5
cnt = 1 # initialize count to 1
next # go to next line
}
{
sum += $5 # accumulate total of 5th column
++cnt # increment count of lines
}
END {
if (cnt > 0) # if count > 0 (avoid divide by 0 on empty file)
print prev, sum / cnt # print the average for the last line
}
' file
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.