[英]split a large string into substrings
I have this file: 我有这个文件:
>first
GTGAAGTGCGGCACCCCGTAGGTCAGACAAGGCGGTCACGCCGCATCCGACATCCAACGCCCGAGCCGGTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACC
>second
CGGTAAT
My expected output is this: 我的预期输出是这样的:
>first
GTGAAGTGCGGCACCCCGTAGGTCAGACAAGGCGGTCACGCCGCATCCGACATCCAACGC
CCCGAGCCGGTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAACC
>second
CGGTAAT
Explanation: If (the line starts with '>') print; 说明:如果(该行以“>”开头)打印; else if length of the string is greater than 60, split the string in substrings of 60.
否则,如果字符串的长度大于60,则将字符串拆分为60个子字符串。
My idea is something like this in awk, but also bash solutions are welcome: 我的想法在awk中是这样的,但也欢迎bash解决方案:
gawk '/^>/ {print;next;} {len=length; if(len>60){DO SOMETHING HERE (LOOP?)} else {print}}'
Any help will be really appreciated! 任何帮助将不胜感激! Thanks
谢谢
You can use built in fold
utility in a BASH loop: 您可以在BASH循环中使用内置的
fold
实用程序:
while read -r f; do
[[ "$f" == '>'* ]] && echo "$f" || echo "$f" | fold -w 60
done < file
Using awk
you can do: 使用
awk
您可以执行以下操作:
$ awk '!/^>/&&length($0)%60{gsub(/.{60}/,"&\n")}1' file
>first
GTGAAGTGCGGCACCCCGTAGGTCAGACAAGGCGGTCACGCCGCATCCGACATCCAACGC
CCGAGCCGGTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAACC
>second
CGGTAAT
Note: If you are using GNU awk
v3.x then add --re-interval
( awk --re-interval '..' file
). 注意:如果您使用的是GNU
awk
v3.x,则添加--re-interval
( awk --re-interval '..' file
)。 For GNU awk
v4 or later as well as BSD awk
it is not required. 对于GNU
awk
v4或更高版本以及BSD awk
这不是必需的。
What about this awk
? 那
awk
呢?
awk -v FS=
'{for (i=0;i<=NF/60;i++) {
for (j=1;j<=60;j++)
printf "%s", $(i*60 +j)
print ""
}
}' file
See output: 查看输出:
$ awk -v FS= '{for (i=0;i<=NF/60;i++) {for (j=1;j<=60;j++) printf "%s", $(i*60 +j); print ""}}' file
>first
GTGAAGTGCGGCACCCCGTAGGTCAGACAAGGCGGTCACGCCGCATCCGACATCCAACGC
CCGAGCCGGTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAACC
>second
CGGTAAT
You can make explicit the >
condition with: 您可以使用以下命令明确显示
>
条件:
awk -v FS= '/^>/ {print; next} {for (i=0;i<=NF/60;i++) {for (j=1;j<=60;j++) printf "%s", $(i*60 +j); print ""}}' file
-v FS=
set field separator to nothing, so that every single character will be a field. -v FS=
将字段分隔符设置为空,以便每个单个字符都是一个字段。 '/^>/ {print; next}
'/^>/ {print; next}
if the line starts with >
, print it and go to the next line. '/^>/ {print; next}
如果行以>
开头),请打印并转到下一行。 {for (i=0;i<=NF/60;i++) {for (j=1;j<=60;j++) printf "%s", $(i*60 +j); print ""}}
{for (i=0;i<=NF/60;i++) {for (j=1;j<=60;j++) printf "%s", $(i*60 +j); print ""}}
on the rest of the cases, loop in blocks of 60 characters, printing all of them and then a new line, until the end of line is reached. {for (i=0;i<=NF/60;i++) {for (j=1;j<=60;j++) printf "%s", $(i*60 +j); print ""}}
,以60个字符为一组,循环打印所有字符,然后打印新行,直到到达行尾。 避免完全分开行,仅手动进行子字符串打印。
awk -v FS='\n' '!/^>/ {for (i=0; i<(length($0)/60); i++) {print substr($0, i*60, 60)}; next}7'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.