简体   繁体   中英

split a large string into substrings

I have this file:

>first
GTGAAGTGCGGCACCCCGTAGGTCAGACAAGGCGGTCACGCCGCATCCGACATCCAACGCCCGAGCCGGTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACC
>second
CGGTAAT

My expected output is this:

>first
GTGAAGTGCGGCACCCCGTAGGTCAGACAAGGCGGTCACGCCGCATCCGACATCCAACGC
CCCGAGCCGGTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAACC
>second
CGGTAAT

Explanation: If (the line starts with '>') print; else if length of the string is greater than 60, split the string in substrings of 60.

My idea is something like this in awk, but also bash solutions are welcome:

gawk '/^>/ {print;next;} {len=length; if(len>60){DO SOMETHING HERE (LOOP?)} else {print}}'

Any help will be really appreciated! Thanks

You can use built in fold utility in a BASH loop:

while read -r f; do
    [[ "$f" == '>'* ]] && echo "$f" || echo "$f" | fold -w 60
done < file

Using awk you can do:

$ awk '!/^>/&&length($0)%60{gsub(/.{60}/,"&\n")}1' file
>first
GTGAAGTGCGGCACCCCGTAGGTCAGACAAGGCGGTCACGCCGCATCCGACATCCAACGC
CCGAGCCGGTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAACC
>second
CGGTAAT

Note: If you are using GNU awk v3.x then add --re-interval ( awk --re-interval '..' file ). For GNU awk v4 or later as well as BSD awk it is not required.

What about this awk ?

awk -v FS= 
    '{for (i=0;i<=NF/60;i++) {
          for (j=1;j<=60;j++)
               printf "%s", $(i*60 +j)
          print ""
          }
     }' file

See output:

$ awk -v FS= '{for (i=0;i<=NF/60;i++) {for (j=1;j<=60;j++) printf "%s", $(i*60 +j); print ""}}' file
>first
GTGAAGTGCGGCACCCCGTAGGTCAGACAAGGCGGTCACGCCGCATCCGACATCCAACGC
CCGAGCCGGTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAACC
>second
CGGTAAT

You can make explicit the > condition with:

awk -v FS= '/^>/ {print; next} {for (i=0;i<=NF/60;i++) {for (j=1;j<=60;j++) printf "%s", $(i*60 +j); print ""}}' file

Explanation

  • -v FS= set field separator to nothing, so that every single character will be a field.
  • '/^>/ {print; next} '/^>/ {print; next} if the line starts with > , print it and go to the next line.
  • {for (i=0;i<=NF/60;i++) {for (j=1;j<=60;j++) printf "%s", $(i*60 +j); print ""}} {for (i=0;i<=NF/60;i++) {for (j=1;j<=60;j++) printf "%s", $(i*60 +j); print ""}} on the rest of the cases, loop in blocks of 60 characters, printing all of them and then a new line, until the end of line is reached.

避免完全分开行,仅手动进行子字符串打印。

awk -v FS='\n' '!/^>/ {for (i=0; i<(length($0)/60); i++) {print substr($0, i*60, 60)}; next}7'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM