将data.frame列拆分为其他列

Question

I have a big data.frame with some columns, but my 9th column is made of data separated by semicolon : 我有一个带有一些列的大data.frame ，但是我的第9列由用分号分隔的数据组成：

    gtf$V9
1                 gene_id CUFF.1; transcript_id CUFF.1.1; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;
2  gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 1; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;
3  gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 2; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;
4  gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 3; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;

So I would like to cut this column into others columns and merge this later with the other part of the data.frame (the others columns before the 9th column). 所以我想将此列切成其他列，然后再与data.frame的其他部分（第9列之前的其他列） merge 。

I've tried some code without results : 我试过一些没有结果的代码：

head(gtf$V9, sep = ";",stringsAsFactors = FALSE)

or 要么

new_df <- matrix(gtf$V9, ncol=7, byrow=TRUE) # sep = ";"

The same thing with as.data.frame , data.frame or as.matrix 与as.data.frame ， data.frame或as.matrix

I have also tried to write.csv and import this with includ a sep=";" 我还尝试了write.csv并使用includ sep=";"导入sep=";" , but the data.frame is too bigger and my computer is lagging.. ，但data.frame太大，我的电脑落后了。

Any advice? 有什么建议吗？

Answer 1

Another option is to use the splitstackshape -package (which also loads data.table ). 另一个选择是使用splitstackshape （它也会加载data.table ）。 Using: 使用：

library(splitstackshape)
cSplit(cSplit(df, 'V9', sep = ';', direction = 'long'),
       'V9', sep = ' ')[, dcast(.SD, cumsum(V9_1 == 'gene_id') ~ V9_1)]

gives: 得到：

  V9_1 conf_hi conf_lo cov exon_number FPKM frac gene_id transcript_id 1: 1 9.805420 4.347062 25.616962 NA 7.0762407256 1.000000 CUFF.1 CUFF.1.1 2: 2 9.805420 4.347062 25.616962 1 7.0762407256 1.000000 CUFF.1 CUFF.1.1 3: 3 9.805420 4.347062 25.616962 2 7.0762407256 1.000000 CUFF.1 CUFF.1.1 4: 4 9.805420 4.347062 25.616962 3 7.0762407256 1.000000 CUFF.1 CUFF.1.1

Answer 2

you could do strsplit() within a sapply() 您可以在strsplit()中执行sapply()

If you know how many objects can be in V9 than you can do a for loop over it 如果您知道V9中可以有多少个对象，可以对其进行for循环

for (i in 1:number_of_max_objects_in_V9) {
 gtf[ncol(gtf)+1] = sapply(1:nrow(gtf), function(x) strsplit(gtf$V9[x],',')[[1]][i])
}

if you don't know how many objects can V9 have, then just run a str_count on , in gtf$V9 like this: 如果您不知道V9可以有多少个对象，则只需在gtf $ V9中的上运行str_count ,如下所示：

library(stringr)
number_of_max_objects_in_V9 <- max(sapply(1:nrow(gtf), function(x) str_count(gtf$V9,',')))

Answer 3

# example dataset (only variable of interest included)
df = data.frame(V9=c("gene_id CUFF.1; transcript_id CUFF.1.1; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;",
                "gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 1; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;",
                "gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 2; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;",
                "gene_id CUFF.1; transcript_id CUFF.1.1; exon_number 3; FPKM 7.0762407256; frac 1.000000; conf_lo 4.347062; conf_hi 9.805420; cov 25.616962;"),
                stringsAsFactors = F)

library(dplyr)
library(tidyr)

df %>%
  mutate(id = row_number()) %>%                  # flag row ids (will need those to reshape data later)                
  separate_rows(V9, sep="; ") %>%                # split strings and create new rows
  separate(V9, c("name","value"), sep=" ") %>%   # separate column name from value
  mutate(value = gsub(";","",value)) %>%         # remove ; when necessary
  spread(name, value)                            # reshape data

#   id  conf_hi  conf_lo       cov exon_number         FPKM     frac gene_id transcript_id
# 1  1 9.805420 4.347062 25.616962        <NA> 7.0762407256 1.000000  CUFF.1      CUFF.1.1
# 2  2 9.805420 4.347062 25.616962           1 7.0762407256 1.000000  CUFF.1      CUFF.1.1
# 3  3 9.805420 4.347062 25.616962           2 7.0762407256 1.000000  CUFF.1      CUFF.1.1
# 4  4 9.805420 4.347062 25.616962           3 7.0762407256 1.000000  CUFF.1      CUFF.1.1

You can join this dataset back to your initial dataset using the row ids ( id ). 您可以使用行ID（ id ）将此数据集重新加入到初始数据集中。 You need to create an id in your original dataset as well. 您还需要在原始数据集中创建一个id 。

将data.frame列拆分为其他列

问题描述

3 个解决方案

解决方案1
3 已采纳 2017-12-08 11:34:02

解决方案2
1 2017-12-08 11:18:13

解决方案3
1 2017-12-08 11:20:03

将data.frame列拆分为其他列

问题描述

3 个解决方案

解决方案1 3 已采纳 2017-12-08 11:34:02

解决方案2 1 2017-12-08 11:18:13

解决方案3 1 2017-12-08 11:20:03

解决方案1
3 已采纳 2017-12-08 11:34:02

解决方案2
1 2017-12-08 11:18:13

解决方案3
1 2017-12-08 11:20:03