简体   繁体   English

将data.frame列拆分为其他列

[英]Split a data.frame column into others columns

I have a big data.frame with some columns, but my 9th column is made of data separated by semicolon : 我有一个带有一些列的大data.frame ,但是我的第9列由用分号分隔的数据组成:

    gtf$V9
1                 gene_id CUFF.1; transcript_id CUFF.1.1; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;
2  gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 1; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;
3  gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 2; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;
4  gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 3; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;

So I would like to cut this column into others columns and merge this later with the other part of the data.frame (the others columns before the 9th column). 所以我想将此列切成其他列,然后再与data.frame的其他部分(第9列之前的其他列) merge

I've tried some code without results : 我试过一些没有结果的代码:

head(gtf$V9, sep = ";",stringsAsFactors = FALSE) 

or 要么

new_df <- matrix(gtf$V9, ncol=7, byrow=TRUE) # sep = ";"

The same thing with as.data.frame , data.frame or as.matrix as.data.framedata.frameas.matrix

I have also tried to write.csv and import this with includ a sep=";" 我还尝试了write.csv并使用includ sep=";"导入sep=";" , but the data.frame is too bigger and my computer is lagging.. ,但data.frame太大,我的电脑落后了。

Any advice? 有什么建议吗?

Another option is to use the splitstackshape -package (which also loads data.table ). 另一个选择是使用splitstackshape (它也会加载data.table )。 Using: 使用:

library(splitstackshape)
cSplit(cSplit(df, 'V9', sep = ';', direction = 'long'),
       'V9', sep = ' ')[, dcast(.SD, cumsum(V9_1 == 'gene_id') ~ V9_1)]

gives: 得到:

  V9_1 conf_hi conf_lo cov exon_number FPKM frac gene_id transcript_id 1: 1 9.805420 4.347062 25.616962 NA 7.0762407256 1.000000 CUFF.1 CUFF.1.1 2: 2 9.805420 4.347062 25.616962 1 7.0762407256 1.000000 CUFF.1 CUFF.1.1 3: 3 9.805420 4.347062 25.616962 2 7.0762407256 1.000000 CUFF.1 CUFF.1.1 4: 4 9.805420 4.347062 25.616962 3 7.0762407256 1.000000 CUFF.1 CUFF.1.1 

you could do strsplit() within a sapply() 您可以在strsplit()中执行sapply()

If you know how many objects can be in V9 than you can do a for loop over it 如果您知道V9中可以有多少个对象,可以对其进行for循环

for (i in 1:number_of_max_objects_in_V9) {
 gtf[ncol(gtf)+1] = sapply(1:nrow(gtf), function(x) strsplit(gtf$V9[x],',')[[1]][i])
}

if you don't know how many objects can V9 have, then just run a str_count on , in gtf$V9 like this: 如果您不知道V9可以有多少个对象,则只需在gtf $ V9中的上运行str_count ,如下所示:

library(stringr)
number_of_max_objects_in_V9 <- max(sapply(1:nrow(gtf), function(x) str_count(gtf$V9,',')))
# example dataset (only variable of interest included)
df = data.frame(V9=c("gene_id CUFF.1; transcript_id CUFF.1.1; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;",
                "gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 1; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;",
                "gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 2; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;",
                "gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 3; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;"),
                stringsAsFactors = F)

library(dplyr)
library(tidyr)

df %>%
  mutate(id = row_number()) %>%                  # flag row ids (will need those to reshape data later)                
  separate_rows(V9, sep="; ") %>%                # split strings and create new rows
  separate(V9, c("name","value"), sep=" ") %>%   # separate column name from value
  mutate(value = gsub(";","",value)) %>%         # remove ; when necessary
  spread(name, value)                            # reshape data

#   id  conf_hi  conf_lo       cov exon_number         FPKM     frac gene_id transcript_id
# 1  1 9.805420 4.347062 25.616962        <NA> 7.0762407256 1.000000  CUFF.1      CUFF.1.1
# 2  2 9.805420 4.347062 25.616962           1 7.0762407256 1.000000  CUFF.1      CUFF.1.1
# 3  3 9.805420 4.347062 25.616962           2 7.0762407256 1.000000  CUFF.1      CUFF.1.1
# 4  4 9.805420 4.347062 25.616962           3 7.0762407256 1.000000  CUFF.1      CUFF.1.1

You can join this dataset back to your initial dataset using the row ids ( id ). 您可以使用行ID( id )将此数据集重新加入到初始数据集中。 You need to create an id in your original dataset as well. 您还需要在原始数据集中创建一个id

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM