[英]Split multiple columns into multiple columns using r
I have a txt file which contains 20 columns and 300 rows. 我有一个包含20列和300行的txt文件。 The sample of my data is given below. 我的数据样本如下。
id sub A1 A2 B1 B2 C1
96 AAA 01:01:01:01/01:01:01:02N 29:02:01 08:01:01/08:19N 44:03:01/44:03:03/44:03:04 07:01:01/07:01:02
97 AAA 03:01:01:01/03:01:01:02N 30:08:01 09:02:01/08:19N 44:03:01/44:03:03/44:03:04 07:01:01/07:01:02
98 AAA 01:01:01:01/01:01:01:02N/01:22N 29:02:01 08:01:01/08:19N 44:03:01/44:03:03/44:03:04 07:09:01/07:01:02
99 AAA 03:01:01:01 30:08:01 09:02:01/08:19N 44:03:01/44:03:03/44:03:04 07:08:01/07:01:02
I need to seperate the columns (A1,A2,B1....) with the seperator "/" using r. 我需要使用r分隔列(A1,A2,B1 ....)和分隔符“/”。 The output would be: 输出将是:
id sub A1_1 A1_2 A2 B1_1 B1_2 B2_1 B2_2 ..
96 AAA 01:01:01:01 01:01:01:02N 29:02:01 08:01:01 08:19N 44:03:01 44:03:03 44:03:04 ...
I could find functions to split one columns into multiple columns. 我可以找到将一列拆分成多列的函数。 But I could not find a solution to achieve this. 但我无法找到解决方案来实现这一目标。
Here is a tidyverse
solution. 这是一个tidyverse
解决方案。
library(tidyverse)
df %>%
gather(key, value, -c(1:2)) %>%
separate_rows(value, sep = "/") %>%
group_by(key, id) %>%
mutate(key2 = paste0(key, "_", seq_along(key))) %>%
ungroup() %>%
select(-key) %>%
spread(key2, value)
# A tibble: 4 x 13
# id sub A1_1 A1_2 A1_3 A2_1 B1_1 B1_2 B2_1 B2_2 B2_3 C1_1 C1_2
#* <fct> <fct> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#1 96 AAA 01:01:01:01 01:01:01:02N <NA> 29:02:01 08:01:01 08:19N 44:03:01 44:03:03 44:03:04 07:01:01 07:01:02
#2 97 AAA 03:01:01:01 03:01:01:02N <NA> 30:08:01 09:02:01 08:19N 44:03:01 44:03:03 44:03:04 07:01:01 07:01:02
#3 98 AAA 01:01:01:01 01:01:01:02N 01:22N 29:02:01 08:01:01 08:19N 44:03:01 44:03:03 44:03:04 07:09:01 07:01:02
#4 99 AAA 03:01:01:01 <NA> <NA> 30:08:01 09:02:01 08:19N 44:03:01 44:03:03 44:03:04 07:08:01 07:01:02
After gather
ing columns all columns except the first and the second ( -c(1:2)
), I used tidyr::separate_rows
to separate the values in newly created column value
by "/"
. 在gather
除第一个和第二个( -c(1:2)
)之外的所有列之后,我使用tidyr::separate_rows
将新创建的列value
除以"/"
。 After creating a new column key2
which is column key
with the extension _1:number of separators
, I unselected column key
and spread
column key2
by value
. 创建一个新列key2
,它是列key
,扩展名为_1:number of separators
,我取消选择列key
并按value
spread
列key2
。
data 数据
df <- structure(list(id = structure(1:4, .Label = c("96", "97",
"98", "99"), class = "factor"), sub = structure(c(1L,
1L, 1L, 1L), .Label = "AAA", class = "factor"), A_A1 = structure(c(1L,
4L, 2L, 3L), .Label = c("01:01:01:01/01:01:01:02N", "01:01:01:01/01:01:01:02N/01:22N",
"03:01:01:01", "03:01:01:01/03:01:01:02N"), class = "factor"),
A_A2 = structure(c(1L, 2L, 1L, 2L), .Label = c("29:02:01",
"30:08:01"), class = "factor"), B_B1 = structure(c(1L,
2L, 1L, 2L), .Label = c("08:01:01/08:19N", "09:02:01/08:19N"
), class = "factor"), B_B2 = structure(c(1L, 1L, 1L, 1L
), .Label = "44:03:01/44:03:03/44:03:04", class = "factor"),
C1 = structure(c(1L, 1L, 3L, 2L), .Label = c("07:01:01/07:01:02",
"07:08:01/07:01:02", "07:09:01/07:01:02"), class = "factor")), .Names = c("id",
"sub", "A_A1", "A_A2", "B_B1", "B_B2", "C_C1"), class = "data.frame", row.names = c(NA,
-4L))
I suggest a reshape2 solution taking care of not knowing the number of parts: 我建议使用reshape2解决方案,不知道零件数量:
> dput(pz1)
structure(list(id = c("HG00096", "HG00097", "HG00098", "HG00099"
), sub = c("GBR", "GBR", "GBR", "GBR"), HLA_A1 = c("01:01:01:01/01:01:01:02N",
"03:01:01:01/03:01:01:02N", "01:01:01:01/01:01:01:02N/01:22N",
"03:01:01:01"), HLA_A2 = c("29:02:01", "30:08:01", "29:02:01",
"30:08:01"), HLA_B1 = c("08:01:01/08:19N", "09:02:01/08:19N",
"08:01:01/08:19N", "09:02:01/08:19N"), HLA_B2 = c("44:03:01/44:03:03/44:03:04",
"44:03:01/44:03:03/44:03:04", "44:03:01/44:03:03/44:03:04", "44:03:01/44:03:03/44:03:04"
), HLA_C1 = c("07:01:01/07:01:02", "07:01:01/07:01:02", "07:09:01/07:01:02",
"07:08:01/07:01:02")), .Names = c("id", "sub", "HLA_A1", "HLA_A2",
"HLA_B1", "HLA_B2", "HLA_C1"), row.names = c(NA, -4L), class = "data.frame")
add this function: 添加此功能:
library("reshape2", lib.loc="~/R/win-library/3.3")
getIt <- function(df,col) {
x <- max(sapply(strsplit(df[,col],split="/"),length)) ### get the max parts for column
q <- colsplit(string = df[,col],pattern="/",names = paste0(names(df)[col],"_",LETTERS[1:x]))
return(q) }
after you have this function you can easily do: 拥有此功能后,您可以轻松完成:
> getIt(pz1,3)
HLA_A1_A HLA_A1_B HLA_A1_C
1 01:01:01:01 01:01:01:02N
2 03:01:01:01 03:01:01:02N
3 01:01:01:01 01:01:01:02N 01:22N
4 03:01:01:01
and a simple cbind with the original dataframe (with or without the original columns) : 和一个简单的cbind与原始数据帧(有或没有原始列):
> cbind(pz1[,1:2],getIt(pz1,3),getIt(pz1,4),getIt(pz1,5),getIt(pz1,6))
id sub HLA_A1_A HLA_A1_B HLA_A1_C HLA_A2_A HLA_B1_A HLA_B1_B HLA_B2_A HLA_B2_B HLA_B2_C
1 HG00096 GBR 01:01:01:01 01:01:01:02N 29:02:01 08:01:01 08:19N 44:03:01 44:03:03 44:03:04
2 HG00097 GBR 03:01:01:01 03:01:01:02N 30:08:01 09:02:01 08:19N 44:03:01 44:03:03 44:03:04
3 HG00098 GBR 01:01:01:01 01:01:01:02N 01:22N 29:02:01 08:01:01 08:19N 44:03:01 44:03:03 44:03:04
4 HG00099 GBR 03:01:01:01 30:08:01 09:02:01 08:19N 44:03:01 44:03:03 44:03:04
I'd take an approach like the following: 我采取如下方法:
library(data.table)
setDT(df) # convert to a data.table
# identify the columns you want to split
cols <- grep("^HLA", names(df), value = TRUE)
# loop through them and split them
# assign them back to the data.table, by reference
for (i in cols) {
temp <- tstrsplit(df[[i]], "/")
set(df, j = sprintf("%s_%d", i, seq_along(temp)), value = temp)
set(df, j = i, value = NULL)
}
Here's the result: 这是结果:
df[]
# id sub HLA_A1_1 HLA_A1_2 HLA_A1_3 HLA_A2_1 HLA_B1_1 HLA_B1_2 HLA_B2_1 HLA_B2_2 HLA_B2_3 HLA_C1_1 HLA_C1_2
# 1: HG00096 GBR 01:01:01:01 01:01:01:02N NA 29:02:01 08:01:01 08:19N 44:03:01 44:03:03 44:03:04 07:01:01 07:01:02
# 2: HG00097 GBR 03:01:01:01 03:01:01:02N NA 30:08:01 09:02:01 08:19N 44:03:01 44:03:03 44:03:04 07:01:01 07:01:02
# 3: HG00098 GBR 01:01:01:01 01:01:01:02N 01:22N 29:02:01 08:01:01 08:19N 44:03:01 44:03:03 44:03:04 07:09:01 07:01:02
# 4: HG00099 GBR 03:01:01:01 NA NA 30:08:01 09:02:01 08:19N 44:03:01 44:03:03 44:03:04 07:08:01 07:01:02
Aside from being easier to scale than the accepted answer (things aren't really hard-coded), this is at least twice as fast as that approach, and a lot faster than the "tidyverse" approach, which is quite inefficient because it first makes the data very long before going back into a wide format. 除了比接受的答案更容易扩展(事情并非真正的硬编码),这至少是该方法的两倍,并且比“tidyverse”方法快得多 ,因为它首先是非常低效的在回到宽格式之前,数据会很长。
To get a sense of the performance difference, try the following: 要了解性能差异,请尝试以下操作:
myfun <- function(df) {
cols <- grep("^HLA", names(df), value = TRUE)
for (i in cols) {
temp <- tstrsplit(df[[i]], "/")
set(df, j = sprintf("%s_%d", i, seq_along(temp)), value = temp)
set(df, j = i, value = NULL)
}
df[]
}
tidyfun <- function(df) {
df %>%
gather(key, value, -c(1:2)) %>%
separate_rows(value, sep = "/") %>%
group_by(key, id) %>%
mutate(key2 = paste0(key, "_", seq_along(key))) %>%
ungroup() %>%
select(-key) %>%
spread(key2, value)
}
getIt <- function(df,col) {
x <- max(sapply(strsplit(as.character(df[,col]),split="/"),length))
q <- colsplit(string = as.character(df[,col]),pattern="/",
names = paste0(names(df)[col],"_",LETTERS[1:x]))
return(q)
}
reshape2fun <- function(dfdf) {
cbind(dfdf[,1:2], getIt(dfdf,3), getIt(dfdf,4), getIt(dfdf,5), getIt(dfdf,6))
}
library(microbenchmark)
dfdf <- as.data.frame(df)
microbenchmark(myfun(copy(df)), reshape2fun(dfdf), tidyfun(df))
# Unit: microseconds
# expr min lq mean median uq max neval
# myfun(copy(df)) 241.55 272.5965 625.7634 359.148 380.0395 28632.94 100
# reshape2fun(dfdf) 5076.24 5368.3835 5841.8784 5539.577 5639.8765 34176.13 100
# tidyfun(df) 37864.68 39435.1915 41152.5916 39801.499 40489.7055 70019.04 100
biggerdf <- rbindlist(replicate(2500, df, FALSE)) # nrow = 10,000
dfdf <- as.data.frame(biggerdf)
microbenchmark(myfun(copy(biggerdf)), reshape2fun(dfdf), tidyfun(biggerdf), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# myfun(copy(biggerdf)) 50.87452 52.0059 54.59288 53.03503 53.79347 68.69892 10
# reshape2fun(dfdf) 120.90291 124.3893 137.54154 126.06213 157.50532 159.15069 10
# tidyfun(biggerdf) 1312.75422 1350.6651 1394.93082 1358.21612 1373.86793 1732.86521 10
BIGGERdf <- rbindlist(replicate(100, biggerdf, FALSE)) # nrow = 1,000,000
dfdf <- as.data.frame(BIGGERdf)
system.time(tidyfun(BIGGERdf)) # > 2 minutes!
# user system elapsed
# 141.373 1.048 142.403
microbenchmark(myfun(copy(BIGGERdf)), reshape2fun(dfdf), times = 5)
# Unit: seconds
# expr min lq mean median uq max neval
# myfun(copy(BIGGERdf)) 5.180048 5.574677 6.026515 5.764467 6.498967 7.114415 5
# reshape2fun(dfdf) 8.858202 9.095027 9.629969 9.264896 10.192161 10.739560 5
I second @Sotos advice, it is important to write a reproducible example so the focus is only on the problem at hand. 我的第二个@Sotos建议,重要的是写一个可重复的例子,所以重点只在于手头的问题。
I came up with this fake data to try to answer your question: 我想出了这些假数据,试图回答你的问题:
> df <- data.frame(
+ id = c(1:5),
+ sub = sample(c("GBR", "BRA"), size = 5, replace = T),
+ HLA_A = paste0(rep("01:01", 5), "/", rep("01:02N")),
+ HLA_B = paste0(rep("01:03", 5), "/", "01:42N", "/", "32:20"),
+ HLA_C = paste0(rep("01:03", 5)), stringsAsFactors = F)
>
>
> df
id sub HLA_A HLA_B HLA_C
1 1 GBR 01:01/01:02N 01:03/01:42N/32:20 01:03
2 2 BRA 01:01/01:02N 01:03/01:42N/32:20 01:03
3 3 GBR 01:01/01:02N 01:03/01:42N/32:20 01:03
4 4 GBR 01:01/01:02N 01:03/01:42N/32:20 01:03
5 5 BRA 01:01/01:02N 01:03/01:42N/32:20 01:03
You can use strsplit()
to split the column by a given character (in this case "/"
). 您可以使用strsplit()
按给定字符(在本例中为"/"
)拆分列。 Use do.call(rbind, .)
to bind the lists in column format. 使用do.call(rbind, .)
以列格式绑定列表。 Repeat this process for the columns you wish to target and them bind them all with the id
and sub
columns. 对要定位的列重复此过程,然后将它们与id
和sub
列绑定在一起。 Here is the solution: 这是解决方案:
Without using any dependencies: 不使用任何依赖项:
> col.ind <- grep(x = names(df), pattern = "HLA", value = T, ignore.case = T) # your target columns
>
> # lapply to loop the column split process, output is a list, so you need to columb-bind the resulting objects
>
> cols.list <- lapply(seq_along(col.ind), function(x){
+
+ p1 <- do.call(rbind, strsplit(df[[col.ind[[x]]]], split = "/")) # split col by "/"
+
+ p2 <- data.frame(p1, stringsAsFactors = F) # make it into a data.frame
+
+ i <- ncol(p2) # this is an index placeholder that will enable you to rename the recently split columns in a sequential manner
+
+ colnames(p2) <- paste0(col.ind[[x]], c(1:i)) # rename columns
+
+ return(p2) # return the object of interest
+ }
+ )
>
>
> new.df <- cbind(df[1:2], do.call(cbind, cols.list)) # do.call once again to bind the lapply object and column-bind those with the first two columns of your initial data.frame
> new.df
id sub HLA_A1 HLA_A2 HLA_B1 HLA_B2 HLA_B3 HLA_C1
1 1 GBR 01:01 01:02N 01:03 01:42N 32:20 01:03
2 2 BRA 01:01 01:02N 01:03 01:42N 32:20 01:03
3 3 GBR 01:01 01:02N 01:03 01:42N 32:20 01:03
4 4 GBR 01:01 01:02N 01:03 01:42N 32:20 01:03
5 5 BRA 01:01 01:02N 01:03 01:42N 32:20 01:03
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.