如何通过REGEX从文本文件中提取某些行

Question

I have this complicated text file with over 22,000 lines: 我有超过22,000行的复杂文本文件：

>Cluster 35
0   2856nt, >tru_clu8_1_inde2_or1... *
>Cluster 36
0   1179nt, >gl_isotig07707... *
1   914nt, >un_isotig04557... at +/94.20%
2   1282nt, >cp_isotig06284... at -/92.43%
3   1137nt, >cp_isotig02981... at -/93.84%
>Cluster 37
0   2835nt, >yl_JTQ_com670_c0_seq1... *
>Cluster 38
0   2275nt, >pb_iso00211... at +/93.93%
1   2647nt, >yl_JTQ_com323_c0_seq1... at +/91.39%

I want clusters with only 1 hit: 我想要只有1个命中的集群：

>Cluster 35
0     2856nt, >tru_clu8_1_inde2_or1... *
>Cluster 37
0     2835nt, >yl_JTQ_com670_c0_seq1... *

Then if possible output in this format: 然后，如果可能，以这种格式输出：

>Cluster 35   tru_clu8_1_inde2_or1
>Cluster 37   yl_JTQ_com670_c0_seq1

Answer 1

$ awk 'NR>2{if(/^>/ && b ~ /^>/) print b"\n"a} {b=a ; a=$0}' infile.txt
>Cluster 35
0   2856nt, >tru_clu8_1_inde2_or1... *
>Cluster 37
0   2835nt, >yl_JTQ_com670_c0_seq1... *

Edit: 编辑：

This will however not work if there is a final cluster with one hit. 但是，如果有一个命中的最终集群，这将不起作用。 This workaround may work, also includes formatted output: 此解决方法可能会起作用，还包括格式化输出：

$ echo ">" >> infile.txt
$ awk 'NR>2{if(/^>/ && b ~ /^>/) {a=gensub(/^.*>(\w+).*/,"\\1", "g", a) ; print b,a} } {b=a ; a=$0}' infile.txt
>Cluster 35 tru_clu8_1_inde2_or1
>Cluster 37 yl_JTQ_com670_c0_seq1

Answer 2

Following regex works for me: 以下正则表达式对我有效：

^>.*\d\R.*$\R(\D)

You can check it online here 您可以在这里在线检查

如何通过REGEX从文本文件中提取某些行

问题描述

2 个解决方案

解决方案1
2 已采纳 2016-08-16 15:07:34

解决方案2
1 2016-08-16 15:18:39

如何通过REGEX从文本文件中提取某些行

问题描述

2 个解决方案

解决方案1 2 已采纳 2016-08-16 15:07:34

解决方案2 1 2016-08-16 15:18:39

解决方案1
2 已采纳 2016-08-16 15:07:34

解决方案2
1 2016-08-16 15:18:39