從prokka gff表中將帶有不同條目數的超分隔列拆分為具有NA的新列（splitstackshape / R）

Question

我有一個包含制表符分隔和分號分隔的數據的文件（.gff格式的prokka注釋文件）。 不幸的是，分號分隔的部分的條目數不一致。

不過，幸運的是，分號后的前導部分（例如ID=或gene= ）是一致的。 我想將此數據准備為R（或R內）的輸入，而沒有不同的列號或空字段。 這些是prokka文件的第一行，其中一些列已刪除：

A1  contig_10   16  192 ID=PROKKA_00004;inference=ab initio prediction:Prodigal:2.6;locus_tag=PROKKA_00004;product=hypothetical protein
A1  contig_100  147 353 ID=PROKKA_00036;inference=ab initio prediction:Prodigal:2.6;locus_tag=PROKKA_00036;product=hypothetical protein
A1  contig_1000 60  434 ID=PROKKA_00892;inference=ab initio prediction:Prodigal:2.6,protein motif:Pfam:PF05893.8;locus_tag=PROKKA_00892;product=Acyl-CoA reductase (LuxC)
A1  contig_10000    132 434 ID=PROKKA_11822;inference=ab initio prediction:Prodigal:2.6;locus_tag=PROKKA_11822;product=hypothetical protein
A1  contig_100003   368 784 ID=PROKKA_96005;gene=fusA_29;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:A5VR09;locus_tag=PROKKA_96005;product=Elongation factor G
A1  contig_100026   38  355 ID=PROKKA_96016;inference=ab initio prediction:Prodigal:2.6;locus_tag=PROKKA_96016;product=hypothetical protein
A1  contig_100027   38  493 ID=PROKKA_96018;inference=ab initio prediction:Prodigal:2.6;locus_tag=PROKKA_96018;product=hypothetical protein
A1  contig_100028   121 1131    ID=PROKKA_96019;eC_number=3.1.-.-;gene=rnjA_3;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:Q45493;locus_tag=PROKKA_96019;product=Ribonuclease J 1
A1  contig_10003    1028    3307    ID=PROKKA_11824;eC_number=1.1.1.40;gene=maeB_1;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:P76558;locus_tag=PROKKA_11824;product=NADP-dependent malic enzyme

所需的輸出為：

  V1            V2  V3  V4 eC_number    gene           ID                                                                 inference    locus_tag note                     product
1 A1     contig_10  16 192      <NA>    <NA> PROKKA_00004                                         ab initio prediction:Prodigal:2.6 PROKKA_00004 <NA>        hypothetical protein
2 A1    contig_100 147 353      <NA>    <NA> PROKKA_00036                                         ab initio prediction:Prodigal:2.6 PROKKA_00036 <NA>        hypothetical protein
3 A1   contig_1000  60 434      <NA>    <NA> PROKKA_00892            ab initio prediction:Prodigal:2.6,protein motif:Pfam:PF05893.8 PROKKA_00892 <NA>   Acyl-CoA reductase (LuxC)
4 A1  contig_10000 132 434      <NA>    <NA> PROKKA_11822                                         ab initio prediction:Prodigal:2.6 PROKKA_11822 <NA>        hypothetical protein
5 A1 contig_100003 368 784      <NA> fusA_29 PROKKA_96005 ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:A5VR09 PROKKA_96005 <NA>         Elongation factor G

Answer 1

可以使用tidyverse和splitstackshape選項。 首先使用say read.table （帶有參數sep="\\t" ）讀取文件數據。 然后使用splitstackshape::splitstackshape將列V5拆分為不同的列。 現在可以將數據更改為長格式並進行處理了。

library(tidyverse)
library(splitstackshape)

# If first 4 columns of "textdata" is separated by "multiple spaces" than read it as
df <- read.table(text = gsub("\\s{2,}","\t",textdata), stringsAsFactors = FALSE, sep = "\t")

# If first 4 columns of "textdata" is separated by "tab" than read it as
df <- read.table(text = textdata, stringsAsFactors = FALSE, sep = "\t")


# Now, process data (Based on feedback from `@crazysantaclaus`)
df %>% cSplit("V5", sep=";") %>%
  gather(Key, value, -c(V1,V2,V3,V4)) %>% 
  separate(value, c("Col","Value"), sep="=") %>% 
  select(-Key) %>% 
  filter(!(is.na(Col) & is.na(Value))) %>% 
  spread(Col, Value)

結果：

#     V1            V2  V3  V4     col1           col2                             col3   col4                                     col5    col6
#1    A1 something_101 789 910 STRING_2 string_integer string with whitespace and:colon STRING string with whitespace and special chars    <NA>
#2    A1 something_100 123 456 STRING_1           <NA> string with whitespace and:colon STRING string with whitespace and special chars  string

數據：

textdata <- "A1 something_100   123 456 col1=STRING_1;col3=string with whitespace and:colon;col4=STRING;col5=string with whitespace and special chars;col6=string
A1  something_101   789 910 col1=STRING_2;col2=string_integer;col3=string with whitespace and:colon;col4=STRING;col5=string with whitespace and special chars"

數據2：

第二組數據。 前4列不用\\t分隔，而是用多個spaces分隔

textdata <- "A1      something_100   123     456     col1=STRING_1;col3=string with whitespace and:colon;col4=STRING;col5=string with whitespace and special chars;col6=string
A1      something_101   789     910     col1=STRING_2;col2=string_integer;col3=string with whitespace and:colon;col4=STRING;col5=string with whitespace and special chars"

從prokka gff表中將帶有不同條目數的超分隔列拆分為具有NA的新列（splitstackshape / R）

問題描述

1 個解決方案

解決方案1
2 已采納 2018-06-04 18:18:27

從prokka gff表中將帶有不同條目數的超分隔列拆分為具有NA的新列（splitstackshape / R）

問題描述

1 個解決方案

解決方案1 2 已采納 2018-06-04 18:18:27

解決方案1
2 已采納 2018-06-04 18:18:27