简体   繁体   English

如何使用 awk 或 sed 获得部分匹配之间的所有行?

[英]How to get all the lines in between partial matches using awk or sed?

My file looks something like this:我的文件看起来像这样:

>Cluster 0
0   58aa, >5XX8A... at 91.38%
1   58aa, >3LDMA... at 100.00%
2   58aa, >3BTHI... at 96.55%
3   65aa, >1F7ZI... *
4   58aa, >3LDJA... at 100.00%
>Cluster 1
0   57aa, >1ZJDB... at 94.74%
1   58aa, >1AAPA... at 91.38%
2   56aa, >5NX1D... at 92.86%
>Cluster 2
0   60aa, >4ISLB... at 98.33%
1   62aa, >4ISNB... at 95.16%
>Cluster 3
0   59aa, >3BYBA... *
1   59aa, >5ZJ3A... at 100.00%
2   59aa, >3UIRC... at 100.00%
3   57aa, >3D65I... at 100.00%

How can I use sed or awk to get the IDs after > (for example: 5XX8A) in between the ">Cluster" ones.如何使用 sed 或 awk 在“集群”之间获取 > 之后的 ID(例如:5XX8A)。 I want to be able to save them separately (in different files).我希望能够单独保存它们(在不同的文件中)。 One file per cluster.每个集群一个文件。 Or something more parsable like a single file with the IDs right next to the cluster number.或者更容易解析的东西,比如 ID 就在集群编号旁边的单个文件。

As a first approach doing something like:作为第一种方法,执行以下操作:

sed -n '/^\>/,/^\>/p' filename 

returns the whole file:/返回整个文件:/

awk to the rescue! awk来救援!

$ awk '/^>Cluster /{close(f); f="Cluster."$2; next} {sub(/>/,"",$3); print $3 > f}' file
  
$ head Cluster*
==> Cluster.0 <==
5XX8A...
3LDMA...
3BTHI...
1F7ZI...
3LDJA...

==> Cluster.1 <==
1ZJDB...
1AAPA...
5NX1D...

==> Cluster.2 <==
4ISLB...
4ISNB...

==> Cluster.3 <==
3BYBA...
5ZJ3A...
3UIRC...
3D65I...

This might work for you (GNU sed):这可能对您有用(GNU sed):

sed -En '/^>(Cluster) /{s//>\1./;:a;x;s/\n(.*)/ echo "\1"/e;x;h;d};s/.*>//;s/ .*//;H;$!d;ba' file

Gather up each cluster in the hold space and using the evaluation flag on the substitution command echo the collection to the file name indicated by the first line of the collection.收集保留空间中的每个集群,并使用替换命令上的评估标志将集合回显到集合第一行指示的文件名。

Alternative method, using sed and piping to sh:替代方法,使用 sed 和管道连接到 sh:

sed '/^>Cluster/{s/ /./;h;d};s/..*>//;s/ .*//;G;x;s/>*/>>/;x;s/\n/ /;s/\S*/echo "&"/' file|sh

Alternative method, using sed and csplit:替代方法,使用 sed 和 csplit:

sed 's/^..*>//;s/ .*//' file | csplit -szf Cluster -b '.%d' --suppress-matched - '/>Cluster/' '{*}'

Manipulate the file into the desired format using sed and then split the file into separate files using csplit.使用 sed 将文件处理为所需的格式,然后使用 csplit 将文件拆分为单独的文件。

NB This may not replicate the filenames faithfully.注意 这可能不会忠实地复制文件名。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM