[英]AWK replace full string in TABLE2 according to TABLE1
I have TABLE1 where first column is a string which should be replaced in the TABLE2 and second column in the TABLE1 is the value which should replace the string.我有 TABLE1,其中第一列是应该在 TABLE2 中替换的字符串,TABLE1 中的第二列是应该替换字符串的值。
TABLE1 looks as this: TABLE1 看起来像这样:
g63. MYL9
g5990. PTC7
g6018. POLYUBQ
g17850. NAA50
Table 2 looks for example like this:表 2 看起来像这样的例子:
PIZI01000001v1 AUGUSTUS gene 751753 768572 0.06 - . g63.
PIZI01000001v1 AUGUSTUS intron 751969 752021 1 - . transcript_id "g63.t1"; gene_id "g63.
PIZI01000001v1 AUGUSTUS gene 16680331 16688019 0.25 + . g630.
PIZI01000001v1 AUGUSTUS intron 16680415 16683083 0.35 + . transcript_id "g630.t1"; gene_id "g630.
PIZI01000001v1 AUGUSTUS gene 16695081 16703546 0.93 + . g631.
PIZI01000001v1 AUGUSTUS gene 16730752 16735366 0.65 + . g632.
PIZI01000008v1 AUGUSTUS gene 1943857 1944177 0.71 - . g6299.
So I assembled the awk command所以我组装了 awk 命令
awk 'FNR==NR { array[$1]==$2; next } { for (i in array) gsub(i, array[i]) }1' TABLE1 TABLE
which works up to the limit that for example with value MYL9 is not replaced only the string g63.它的工作达到了极限,例如,值 MYL9 不会仅替换字符串 g63。 but also the strings like g630, g631, g632... g6300..... and so on.
还有 g630、g631、g632...g6300...等字符串。 So the Final table would look like this
所以决赛桌看起来像这样
PIZI01000001v1 AUGUSTUS gene 751753 768572 0.06 - . MYL9
PIZI01000001v1 AUGUSTUS intron 751969 752021 1 - . transcript_id "MYL9"; gene_id "MYL9
PIZI01000001v1 AUGUSTUS gene 16680331 16688019 0.25 + . MYL9
PIZI01000001v1 AUGUSTUS intron 16680415 16683083 0.35 + . transcript_id "MYL9t1"; gene_id "MYL9
PIZI01000001v1 AUGUSTUS gene 16695081 16703546 0.93 + . MYL9
PIZI01000001v1 AUGUSTUS gene 16730752 16735366 0.65 + . MYL9
PIZI01000008v1 AUGUSTUS gene 1943857 1944177 0.71 - . g6299.
And I need it to edit jus g63.我需要它来编辑 jus g63。 and not other like g630.
而不是其他像g630。 and so on.
等等。
I spend quite long time with this and now I have to take pause, so if anybody has an idea whats wrong there, I would appreciate.我在这上面花了很长时间,现在我不得不停下来,所以如果有人知道那里出了什么问题,我将不胜感激。 Thanks
谢谢
Your example doesn't really illustrate the problem, but perhaps this is what you're hoping to achieve?您的示例并没有真正说明问题,但也许这就是您希望实现的目标?
head table*
==> table1.txt <==
g63. MYL9
g25. PTC7
g6018. POLYUBQ
g17850. NAA50
==> table2.txt <==
PIZI01000001v1 AUGUSTUS transcript 1 6991 0.4 - . g25.t1
PIZI01000001v1 AUGUSTUS intron 1 3122 0.71 - . transcript_id "g25.t1"; gene_id "g25.";
PIZI01000001v1 AUGUSTUS CDS 3123 3304 0.76 - 2 transcript_id "g25.t1"; gene_id "g25.";
PIZI01000001v1 AUGUSTUS intron 3305 4460 1 - . transcript_id "g25.t1"; gene_id "g25.";
PIZI01000001v1 AUGUSTUS CDS 4461 4598 1 - 2 transcript_id "g25.t1"; gene_id "g25.";
PIZI01000001v1 AUGUSTUS intron 4599 5201 1 - . transcript_id "g25.t1"; gene_id "g25.";
PIZI01000001v1 AUGUSTUS CDS 5202 5342 1 - 2 transcript_id "g25.t1"; gene_id "g25.";
PIZI01000001v1 AUGUSTUS intron 5343 6978 0.54 - . transcript_id "g25.t1"; gene_id "g25.";
PIZI01000001v1 AUGUSTUS CDS 6979 6991 0.54 - 0 transcript_id "g25.t1";
awk 'NR==FNR{a[$1]=$2; next} NR>FNR{unchanged=$0; gsub(/\"/, ""); gsub(/\;/, ""); if($NF in a) {print unchanged, a[$NF]}}' table1.txt table2.txt
PIZI01000001v1 AUGUSTUS intron 1 3122 0.71 - . transcript_id "g25.t1"; gene_id "g25."; PTC7
PIZI01000001v1 AUGUSTUS CDS 3123 3304 0.76 - 2 transcript_id "g25.t1"; gene_id "g25."; PTC7
PIZI01000001v1 AUGUSTUS intron 3305 4460 1 - . transcript_id "g25.t1"; gene_id "g25."; PTC7
PIZI01000001v1 AUGUSTUS CDS 4461 4598 1 - 2 transcript_id "g25.t1"; gene_id "g25."; PTC7
PIZI01000001v1 AUGUSTUS intron 4599 5201 1 - . transcript_id "g25.t1"; gene_id "g25."; PTC7
PIZI01000001v1 AUGUSTUS CDS 5202 5342 1 - 2 transcript_id "g25.t1"; gene_id "g25."; PTC7
PIZI01000001v1 AUGUSTUS intron 5343 6978 0.54 - . transcript_id "g25.t1"; gene_id "g25."; PTC7
I may have misunderstood the problem though;我可能误解了这个问题; please edit your question if this doesn't solve your issue.
如果这不能解决您的问题,请编辑您的问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.