简体   繁体   English

拆分不同的长度值并绑定到列

[英]Split different lengths values and bind to columns

I've got a rather large (around 100k observations) data set, similar to this: 我有一个相当大的(大约10万个观测值)数据集,类似于:

data <- data.frame(
                 ID = seq(1, 5, 1),
                 Values = c("1,2,3", "4", " ", "4,1,6,5,1,1,6", "0,0"), 
                 stringsAsFactors=F)
data
  ID        Values
1  1         1,2,3
2  2             4
3  3              
4  4 4,1,6,5,1,1,6
5  5           0,0

I want to split the Values column by "," with NA for missed cells: 我想将“值”列拆分为"," ,对于遗漏的单元格","使用NA

ID v1 v2 v3 v4 v5 v6 v7
1  1  2  3  NA NA NA NA
2  4  NA NA NA NA NA NA
3  NA NA NA NA NA NA NA
4  4  1  6  5  1  1  6
5  0  0  NA NA NA NA NA
...

Best attempt was strsplit + rbind : 最好的尝试是strsplit + rbind

df <- data.frame(do.call(
                        "rbind",
                        strsplit(as.character(data$Values), split = "," , fixed = FALSE)
                        ))

But rbind function just recycles all 'short' rows instead to set an "NA". 但是rbind函数只是回收所有“短”行而不是设置“NA”。 Have found similar problem 发现了类似的问题

Many thanks, Leo 非常感谢,Leo

I would suggest looking at my cSplit function or approaching the problem manually. 我建议查看我的cSplit功能或手动解决问题。

The cSplit approach would simply be: cSplit方法很简单:

cSplit(data, "Values", ",")
#    ID Values_1 Values_2 Values_3 Values_4 Values_5 Values_6 Values_7
# 1:  1        1        2        3       NA       NA       NA       NA
# 2:  2        4       NA       NA       NA       NA       NA       NA
# 3:  3                NA       NA       NA       NA       NA       NA
# 4:  4        4        1        6        5        1        1        6
# 5:  5        0        0       NA       NA       NA       NA       NA

Approaching the problem manually would look like: 手动接近问题看起来像:

## Split up the values
Split <- strsplit(data$Values, ",", fixed = TRUE)
## How long is each list element?
Ncol <- vapply(Split, length, 1L)
## Create an empty character matrix to store the results
M <- matrix(NA_character_, nrow = nrow(data),
            ncol = max(Ncol), 
            dimnames = list(NULL, paste0("V", sequence(max(Ncol)))))
## Use matrix indexing to figure out where to put the results
M[cbind(rep(1:nrow(data), Ncol), 
        sequence(Ncol))] <- unlist(Split, use.names = FALSE)
## Bind the values back together, here as a "data.table" (faster)
data.table(ID = data$ID, M)

^^ That's pretty much what goes on in cSplit , but the function has a few other options and some basic error checking and so on that might make it a little bit slower than a purely manual approach (or a function written to address your specific problem). ^^这几乎是在cSplit中发生的cSplit ,但是该函数有一些其他选项和一些基本的错误检查等等,这可能会使它比纯手动方法(或为解决您的特定问题而编写的函数)慢一点)。

Both of these approaches would be faster than a "data.table" + "reshape2" approach. 这两种方法都比“data.table”+“reshape2”方法更快。 Also, since each row is treated individually, you shouldn't have any problems even if you have duplicated ID values--your output should have the same number of rows as your input. 此外,由于每行都是单独处理的,即使您有重复的ID值,也不应该有任何问题 - 您的输出应该与输入具有相同的行数。


Benchmarks 基准

I've done benchmarks on more rows and on data that would give "wider" results (since that's implied in your comments to David's answer). 我已经在更多行和数据上做了基准测试,这些测试会产生“更广泛”的结果(因为在你对David的答案的评论中暗示了这一点)。

Here is the sample data: 以下是示例数据:

set.seed(1)
a <- sample(0:100, 100000, TRUE)
Values <- vapply(a, function(x) 
  paste(sample(0:100, x, TRUE), collapse = ","), character(1L))
Values[sample(length(Values), length(Values) * .15)] <- ""
ID <- c(1:80000, 1:20000)
data <- data.frame(ID, Values, stringsAsFactors = FALSE)
DT <- as.data.table(data)

Here are the functions to test: 以下是要测试的功能:

fun1a <- function(inDT) {
  data2 <- DT[, list(Values = unlist(
    strsplit(Values, ","))), by = ID]
  data2[, Var := paste0("v", seq_len(.N)), by = ID] 
  dcast.data.table(data2, ID ~ Var, 
                   fill = NA_character_, 
                   value.var = "Values")
}

fun1b <- function(inDT) {
  data2 <- DT[, list(Values = unlist(
    strsplit(Values, ",", fixed = TRUE), 
    use.names = FALSE)), by = ID]
  data2[, Var := paste0("v", seq_len(.N)), by = ID] 
  dcast.data.table(data2, ID ~ Var, 
                   fill = NA_character_, 
                   value.var = "Values")
}

fun2 <- function(inDT) {
  cSplit(DT, "Values", ",")
}

fun3 <- function(inDF) {
  Split <- strsplit(inDF$Values, ",", fixed = TRUE)
  Ncol <- vapply(Split, length, 1L)
  M <- matrix(NA_character_, nrow = nrow(inDF),
              ncol = max(Ncol), 
              dimnames = list(NULL, paste0("V", sequence(max(Ncol)))))
  M[cbind(rep(1:nrow(inDF), Ncol), 
          sequence(Ncol))] <- unlist(Split, use.names = FALSE)
  data.table(ID = inDF$ID, M)
}

Here are the results: 结果如下:

library(microbenchmark)
microbenchmark(fun2(DT), fun3(data), times = 20)
# Unit: seconds
#        expr      min       lq   median       uq      max neval
#    fun2(DT) 4.810942 5.173103 5.498279 5.622279 6.003339    20
#  fun3(data) 3.847228 3.929311 4.058728 4.160082 4.664568    20

## Didn't want to microbenchmark here...
system.time(fun1a(DT))
#    user  system elapsed 
#   16.92    0.50   17.59
system.time(fun1b(DT))  # fixed = TRUE & use.names = FALSE
#    user  system elapsed 
#   11.54    0.42   12.01

NOTE: The results of fun1a and fun1b would not be the same as those of fun2 and fun3 because of the duplicated IDs. 注:结果fun1afun1b不会是相同的fun2fun3因为重复的ID。

Here's a data.table combined with reshape2 approach (should be very efficient) 这是一个data.table结合reshape2方法(应该非常有效)

library(data.table) # Loading `data.table` package
data2 <- setDT(data)[, list(Values = unlist(strsplit(Values, ","))), by = ID] # splitting the values by `,` for each `ID`
data2[, Var := paste0("v", seq_len(.N)), by = ID] # Adding the `Var` variable

library(reshape2) # Loading `reshape2` package
dcast.data.table(data2, ID ~ Var, fill = NA_character_, value.var = "Values") # decasting

#    ID v1 v2 v3 v4 v5 v6 v7
# 1:  1  1  2  3 NA NA NA NA
# 2:  2  4 NA NA NA NA NA NA
# 3:  3    NA NA NA NA NA NA
# 4:  4  4  1  6  5  1  1  6
# 5:  5  0  0 NA NA NA NA NA

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM