简体   繁体   中英

Concatenate select rows into one row without space in R (using forloop)

I'm trying to concatenate multiple rows into one.

Each row, it is either start with ">Gene Identifier" or Sequence information

>Zfyve21|ENSMUSG00000021286|ENSMUST00000021714 GCGGGCGGGGCGGGGTGGCGCCTTGTGTGGGCTCAGCGCGGGCGGTGGCGTGAGGGGCTC AGGCGGAGA

>Laptm4a|ENSMUSG00000020585|ENSMUST00000020909 GCAGTGACAAAGACAACGTGGCGAAAGACAGCGCCAAAAATCTCCGTGCCCGCTGTCTGC CACCAACTCCGTCTTGTTTCACCCTTCTCCTCCTTGCGGAGCTCGTCTGGGAGACGGTGA ATTACCGAGTTACCCTCAATTCCTACAGCCCCCGACAGCGAGCCCAGCCACGCGCACCGC GGTCAAACAGCGCCGGAGAGAGTTGAACTTTTGATTGGGCGTGATCTGTTTCAATCTCCA CATCTTCTCCAATCAGAAGCCAGGTAGCCCGGCCTTCCGCTCTTCGTTGGTCTGT

Here I just put two genes, but there are hundreds of genes following this. Basically I will just leave the gene identifier as this, but I want to concatenate sequences only when it is separated into multiple rows.

Therefore, the final results should look like this: The sequences were concatenated and combined into one row, without any space inbetween.

>Zfyve21|ENSMUSG00000021286|ENSMUST00000021714 GCGGGCGGGGCGGGGTGGCGCCTTGTGTGGGCTCAGCGCGGGCGGTGGCGTGAGGGGCTCAGGCGGAGA

>Laptm4a|ENSMUSG00000020585|ENSMUST00000020909 GCAGTGACAAAGACAACGTGGCGAAAGACAGCGCCAAAAATCTCCGTGCCCGCTGTCTGCCACCAACTCCGTCTTGTTTCACCCTTCTCCTCCTTGCGGAGCTCGTCTGGGAGACGGTGAATTACCGAGTTACCCTCAATTCCTACAGCCCCCGACAGCGAGCCCAGCCACGCGCACCGCGGTCAAACAGCGCCGGAGAGAGTTGAACTTTTGATTGGGCGTGATCTGTTTCAATCTCCACATCTTCTCCAATCAGAAGCCAGGTAGCCCGGCCTTCCGCTCTTCGTTGGTCTGT

By using "paste" function in R, I was able to achieve this manually.
ie paste(dat[2,1], dat[3,1], sep="")

However, I have a list of hundreads of gene, so I need a way to concatenate rows automatically.

I was thinking forloop, basically, if the row starts from ">", skip it, but if it is not start from ">", concatenate.

But I'm not expert in bioinformatics/R, it is hard for me to actually generate a script to achieve it.

Any help would be greatly appreciated!

Something happened when I pasted this into the answer box to concatenate the data lines but they were separate in my R session so this should work:

Lines <- 
readLines(textConnection(">*>Zfyve21|ENSMUSG00000021286|ENSMUST00000021714
GCGGGCGGGGCGGGGTGGCGCCTTGTGTGGGCTCAGCGCGGGCGGTGGCGTGAGGGGCTCAGGCGGAGA*

>*>Laptm4a|ENSMUSG00000020585|ENSMUST00000020909
GCAGTGACAAAGACAACGTGGCGAAAGACAGCGCCAAAAATCTCCGTGCCCGCTGTCTGCCACCAACTCCGTCTTGTTTCACCCTTCTCCTCCTTGCGGAGCTCGTCTGGGAGACGGTGAATTACCGAGTTACCCTCAATTCCTACAGCCCCCGACAGCGAGCCCAGCCACGCGCACCGCGGTCAAACAGCGCCGGAGAGAGTTGAACTTTTGATTGGGCGTGATCTGTTTCAATCTCCACATCTTCTCCAATCAGAAGCCAGGTAGCCCGGCCTTCCGCTCTTCGTTGGTCTGT*
"))

 geneIdx <- grepl("\\|", Lines)
  grp <- cumsum(geneIdx)
 grp
#[1] 1 1 1 2 2 2

 tapply(Lines, grp, FUN=function(x) c(x[1], paste(x[-1], collapse="") ) )
#----------------------
$`1`
[1] ">*>Zfyve21|ENSMUSG00000021286|ENSMUST00000021714"                      
[2] "GCGGGCGGGGCGGGGTGGCGCCTTGTGTGGGCTCAGCGCGGGCGGTGGCGTGAGGGGCTCAGGCGGAGA*"

$`2`
[1] ">*>Laptm4a|ENSMUSG00000020585|ENSMUST00000020909"                                                                                                                                                                                                                                                        
[2] "GCAGTGACAAAGACAACGTGGCGAAAGACAGCGCCAAAAATCTCCGTGCCCGCTGTCTGCCACCAACTCCGTCTTGTTTCACCCTTCTCCTCCTTGCGGAGCTCGTCTGGGAGACGGTGAATTACCGAGTTACCCTCAATTCCTACAGCCCCCGACAGCGAGCCCAGCCACGCGCACCGCGGTCAAACAGCGCCGGAGAGAGTTGAACTTTTGATTGGGCGTGATCTGTTTCAATCTCCACATCTTCTCCAATCAGAAGCCAGGTAGCCCGGCCTTCCGCTCTTCGTTGGTCTGT*"

Would regular expressions do the trick? The regular expression below deletes newlines ( \\\\n ) not followed by > ( (?!>) being a negative lookahead ).

text <-">Zfyve21|ENSMUSG00000021286|ENSMUST00000021714
GCGGGCGGGGCGGGGTGGCGCCTTGTGTGGGCTCAGCGCGGGCGGTGGCGTGAGGGGCTC
AGGCGGAGA

>Laptm4a|ENSMUSG00000020585|ENSMUST00000020909
GCAGTGACAAAGACAACGTGGCGAAAGACAGCGCCAAAAATCTCCGTGCCCGCTGTCTGC
CACCAACTCCGTCTTGTTTCACCCTTCTCCTCCTTGCGGAGCTCGTCTGGGAGACGGTGA
ATTACCGAGTTACCCTCAATTCCTACAGCCCCCGACAGCGAGCCCAGCCACGCGCACCGC
GGTCAAACAGCGCCGGAGAGAGTTGAACTTTTGATTGGGCGTGATCTGTTTCAATCTCCA
CATCTTCTCCAATCAGAAGCCAGGTAGCCCGGCCTTCCGCTCTTCGTTGGTCTGT"

cat(text)

cat(gsub("\\n(?!>)", "", text, perl=TRUE))

Result

>Zfyve21|ENSMUSG00000021286|ENSMUST00000021714GCGGGCGGGGCGGGGTGGCGCCTTGTGTGGGCTCAGCGCGGGCGGTGGCGTGAGGGGCTCAGGCGGAGA
>Laptm4a|ENSMUSG00000020585|ENSMUST00000020909GCAGTGACAAAGACAACGTGGCGAAAGACAGCGCCAAAAATCTCCGTGCCCGCTGTCTGCCACCAACTCCGTCTTGTTTCACCCTTCTCCTCCTTGCGGAGCTCGTCTGGGAGACGGTGAATTACCGAGTTACCCTCAATTCCTACAGCCCCCGACAGCGAGCCCAGCCACGCGCACCGCGGTCAAACAGCGCCGGAGAGAGTTGAACTTTTGATTGGGCGTGATCTGTTTCAATCTCCACATCTTCTCCAATCAGAAGCCAGGTAGCCCGGCCTTCCGCTCTTCGTTGGTCTGT

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM