简体   繁体   English

将大字符串拆分为子字符串

[英]split a large string into substrings

I have this file: 我有这个文件:

>first
GTGAAGTGCGGCACCCCGTAGGTCAGACAAGGCGGTCACGCCGCATCCGACATCCAACGCCCGAGCCGGTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACC
>second
CGGTAAT

My expected output is this: 我的预期输出是这样的:

>first
GTGAAGTGCGGCACCCCGTAGGTCAGACAAGGCGGTCACGCCGCATCCGACATCCAACGC
CCCGAGCCGGTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAACC
>second
CGGTAAT

Explanation: If (the line starts with '>') print; 说明:如果(该行以“>”开头)打印​​; else if length of the string is greater than 60, split the string in substrings of 60. 否则,如果字符串的长度大于60,则将字符串拆分为60个子字符串。

My idea is something like this in awk, but also bash solutions are welcome: 我的想法在awk中是这样的,但也欢迎bash解决方案:

gawk '/^>/ {print;next;} {len=length; if(len>60){DO SOMETHING HERE (LOOP?)} else {print}}'

Any help will be really appreciated! 任何帮助将不胜感激! Thanks 谢谢

You can use built in fold utility in a BASH loop: 您可以在BASH循环中使用内置的fold实用程序:

while read -r f; do
    [[ "$f" == '>'* ]] && echo "$f" || echo "$f" | fold -w 60
done < file

Using awk you can do: 使用awk您可以执行以下操作:

$ awk '!/^>/&&length($0)%60{gsub(/.{60}/,"&\n")}1' file
>first
GTGAAGTGCGGCACCCCGTAGGTCAGACAAGGCGGTCACGCCGCATCCGACATCCAACGC
CCGAGCCGGTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAACC
>second
CGGTAAT

Note: If you are using GNU awk v3.x then add --re-interval ( awk --re-interval '..' file ). 注意:如果您使用的是GNU awk v3.x,则添加--re-intervalawk --re-interval '..' file )。 For GNU awk v4 or later as well as BSD awk it is not required. 对于GNU awk v4或更高版本以及BSD awk这不是必需的。

What about this awk ? awk呢?

awk -v FS= 
    '{for (i=0;i<=NF/60;i++) {
          for (j=1;j<=60;j++)
               printf "%s", $(i*60 +j)
          print ""
          }
     }' file

See output: 查看输出:

$ awk -v FS= '{for (i=0;i<=NF/60;i++) {for (j=1;j<=60;j++) printf "%s", $(i*60 +j); print ""}}' file
>first
GTGAAGTGCGGCACCCCGTAGGTCAGACAAGGCGGTCACGCCGCATCCGACATCCAACGC
CCGAGCCGGTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAACC
>second
CGGTAAT

You can make explicit the > condition with: 您可以使用以下命令明确显示>条件:

awk -v FS= '/^>/ {print; next} {for (i=0;i<=NF/60;i++) {for (j=1;j<=60;j++) printf "%s", $(i*60 +j); print ""}}' file

Explanation 说明

  • -v FS= set field separator to nothing, so that every single character will be a field. -v FS=将字段分隔符设置为空,以便每个单个字符都是一个字段。
  • '/^>/ {print; next} '/^>/ {print; next} if the line starts with > , print it and go to the next line. '/^>/ {print; next}如果行以>开头),请打印并转到下一行。
  • {for (i=0;i<=NF/60;i++) {for (j=1;j<=60;j++) printf "%s", $(i*60 +j); print ""}} {for (i=0;i<=NF/60;i++) {for (j=1;j<=60;j++) printf "%s", $(i*60 +j); print ""}} on the rest of the cases, loop in blocks of 60 characters, printing all of them and then a new line, until the end of line is reached. 在其余情况下, {for (i=0;i<=NF/60;i++) {for (j=1;j<=60;j++) printf "%s", $(i*60 +j); print ""}} ,以60个字符为一组,循环打印所有字符,然后打印新行,直到到达行尾。

避免完全分开行,仅手动进行子字符串打印。

awk -v FS='\n' '!/^>/ {for (i=0; i<(length($0)/60); i++) {print substr($0, i*60, 60)}; next}7'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM