简体   繁体   English

R:拆分列和获取类别值总和的更简单方法

[英]R: simpler ways of splitting columns and getting sums of category values

I have a data frame in which each a row is an observation, the last column is called 'overlaps' and shows observations from a different dataset which occur at the same time as the observations in this data frame.我有一个数据框,其中每一行都是一个观察值,最后一列称为“重叠”并显示来自不同数据集的观察值,这些观察值与该数据框中的观察值同时发生。

The results I have come from a question I previously asked about how to get overlapping data out of a data frame.我的结果来自我之前提出的关于如何从数据框中获取重叠数据的问题。

All of these overlapping observations have been concatenated together into a single column as such:所有这些重叠的观察结果都被连接成一列,如下所示:

 [1] "1_hands:N:1.768,1_hands:N:3.343,2_body:N:14.272"                                                                                                                                  
 [2] "1_hands:CH2:4.021,2_body:N:14.272"                                                                                                                                                
 [3] "1_hands:N:1.862,2_body:N:4.825"                                                                                                                                                   
 [4] "1_hands:CH2:1.978,2_body:N:4.825,2_body:CH1:1.075"                                                                                                                                
 [5] "1_hands:CH1:0.821,1_hands:N:1.417,1_hands:N:2.213,2_body:N:5.485"                                                                                                                 
 [6] "1_hands:CH1:3.557,2_body:N:3.519"                                                                                                                                                 
 [7] "1_hands:CH1:3.557,1_hands:N:1.249,2_body:N:3.519"                                                                                                                                 
 [8] "1_hands:CH1:4.896,2_body:CH1:3.308"                                                                                                                                               
 [9] "1_hands:CH1:4.896,2_body:CH1:3.308,2_body:N:1.67"                                                                                                                                 
[10] "1_hands:CH1:4.896,2_body:N:1.67,2_body:CH1:5.288"

Each observation is separated by ",".每个观察值由“,”分隔。 The ":" separates different elements of the observation. “:”分隔观察的不同元素。 For example the observation:例如观察:

1_hands:N:1.768 1_手数:N:1.768

would divided up as such:会这样划分:

1_hands = category 1_hands = 类别

N = value N = 值

1.768 = duration 1.768 = 持续时间

What I want to do, is get the sum total duration of each category and value, essentially, I want to add up the durations of every "1_hands:N:X".我想要做的是获取每个类别和值的总持续时间,本质上,我想将每个“1_hands:N:X”的持续时间相加。

One way to do this is with the stringr package, I can use the various str_split functions to continuously break down the observations by delimiters "," and ":", to finally get a column of just the duration values of a particular category and value, which I could then get the sum total of.一种方法是使用 stringr package,我可以使用各种 str_split 函数通过分隔符“,”和“:”连续分解观察结果,最终得到一列仅包含特定类别和值的持续时间值的列,然后我可以获得总和。

However, it's monstrously inefficient, and I have to do this for multiple data sets.但是,它的效率非常低,我必须对多个数据集执行此操作。

Is there an easier way to do this?有没有更简单的方法来做到这一点? Is it possible to loop through that data as such to just get the sum totals I need without breaking it down into multiple sets of data frames?是否可以循环遍历这些数据以获得我需要的总和而不将其分解为多组数据帧?

Not sure what exact efficiency you are trying to achieve, but this solution should be reasonably fast不确定您要达到的确切效率,但此解决方案应该相当快

library(dplyr)
library(data.table)
library(stringr)
library(purrr)

df1 <- your_data[1:5,1]
df2 <- your_data[6:10,1]

myFun <- function(data){
  temp <- data.table(vars = data)[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed = TRUE)))] %>% na.omit()
  temp <- setDT(tstrsplit(temp$vars, ":", fixed = TRUE, names = c("category", "value", "duration")))
}

dt <- list(df1, df2) %>%
  purrr::map(~ myFun(.x)) %>%
  rbindlist()
dt <- dt[, duration := as.numeric(duration)]

dt_sum <- dt[,.(durSum = sum(duration)), by = c("category", "value")]

Please check below code请检查以下代码

data数据

df <- data.frame(string=c("1_hands:N:1.768,1_hands:N:3.343,2_body:N:14.272",
                          "1_hands:N:1.768,1_hands:N:3.343,2_body:N:14.272",                                                                                                                                  
                          "1_hands:CH2:4.021,2_body:N:14.272",                                                                                                                                                
                          "1_hands:N:1.862,2_body:N:4.825",                                                                                                                                                   
                          "1_hands:CH2:1.978,2_body:N:4.825,2_body:CH1:1.075",                                                                                                                                
                          "1_hands:CH1:0.821,1_hands:N:1.417,1_hands:N:2.213,2_body:N:5.485",                                                                                                                 
                          "1_hands:CH1:3.557,2_body:N:3.519",                                                                                                                                                 
                          "1_hands:CH1:3.557,1_hands:N:1.249,2_body:N:3.519",                                                                                                                                 
                          "1_hands:CH1:4.896,2_body:CH1:3.308",                                                                                                                                               
                          "1_hands:CH1:4.896,2_body:CH1:3.308,2_body:N:1.67",                                                                                                                                 
                          "1_hands:CH1:4.896,2_body:N:1.67,2_body:CH1:5.288"))

code代码

df %>% 
  tidyr::extract(string, into = c('category','value','duration'), regex = '(.*):(.*):(.*)') %>% 
  group_by(category, value) %>% summarise(duration=sum(as.numeric(duration)))

Created on 2023-01-27 with reprex v2.0.2创建于 2023-01-27,使用reprex v2.0.2

output output

# A tibble: 5 × 3
# Groups:   category [2]
  category value duration
  <chr>    <chr>    <dbl>
1 1_hands  CH1      22.6 
2 1_hands  CH2       6.00
3 1_hands  N        17.0 
4 2_body   CH1      13.0 
5 2_body   N        68.3 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM