[英]Conditional grouping and summarizing data frame in [R]
我有一個這樣的數據框:
df <- data.frame(ID = c("A", "A", "B", "B", "C", "C"),
time = c(3.1,3.2,6.5,12.3, 3.2, 3.4),
intensity = c(10, 20, 30, 40, 50, 60))
|ID | time| intensity| |:--|----:|---------:| |A | 3.1| 10| |A | 3.2| 20| |B | 6.5| 30| |B | 12.3| 40| |C | 3.2| 50| |C | 3.4| 60|
我想僅在時間差小於0.3時通過ID匯總值(和強度)。 首先我計算了這個時差:
df.2 <- df %>%
group_by(ID) %>%
mutate(time.diff = max(time) - min(time))
...導致:
|ID | time| intensity| time.diff| |:--|----:|---------:|---------:| |A | 3.1| 10| 0.1| |A | 3.2| 20| 0.1| |B | 6.5| 30| 5.8| |B | 12.3| 40| 5.8| |C | 3.2| 50| 0.2| |C | 3.4| 60| 0.2|
為了清楚起見,我希望得到的輸出是:
|ID | time| intensity| time.diff| |:--|----:|---------:|---------:| |A | 3.15| 30| 0.1| |B | 6.5| 30| 5.8| |B | 12.3| 40| 5.8| |C | 3.3| 110| 0.2|
現在時間是綜合觀測的平均值,而強度是它們的總和。 ID“B”保持兩個觀察值,因為它的時間差大於0.3。 我已嘗試使用dplyr,但總結將總是刪除“B”的一個觀察,我想保留它們,我不知道如何做一個有條件的 _group_by_。
我感謝你的任何想法!!
data.table
的可能選項
library(data.table)
unique(setDT(df)[, time.diff := max(time)-min(time), ID][
time.diff <= 0.3, c('time', 'intensity') := list(mean(time),
sum(intensity)), ID])
# ID time intensity time.diff
#1: A 3.15 30 0.1
#2: B 6.50 30 5.8
#3: B 12.30 40 5.8
#4: C 3.30 110 0.2
或者使用dplyr
library(dplyr)
df %>%
group_by(ID) %>%
mutate(time.diff=max(time)-min(time), indx=all(time.diff<=0.3),
intensity=ifelse(indx, sum(intensity), intensity),
time=ifelse(indx, mean(time), time)) %>%
filter(!indx|row_number()==1) %>%
select(-indx)
# ID time intensity time.diff
#1 A 3.15 30 0.1
#2 B 6.50 30 5.8
#3 B 12.30 40 5.8
#4 C 3.30 110 0.2
data.table
解決方案的另一種變體:
setDT(df)[, time.diff := max(time) - min(time), by = ID
][, if (time.diff <= 0.3)
.(time = mean(time), intensity = sum(intensity))
else .SD, by = .(ID, time.diff)]
# ID time.diff time intensity
# 1: A 0.1 3.15 30
# 2: B 5.8 6.50 30
# 3: B 5.8 12.30 40
# 4: C 0.2 3.30 110
# get time.diff
df$time.diff <- ave(x = df$time,df$ID,FUN = function(x){max(x)-min(x)})
# new split variable to use with ID
df$cut <- cumsum(df$time.diff > .3)
# aggregate everything you need and ignore the cut variable
require(plyr)
ddply(df,c('cut','ID'),summarize,
time = mean(time),
intensity = sum(intensity),
time.diff = mean(time.diff))[2:5]
使用sqldf
:
library(sqldf)
sqldf('SELECT ID, AVG(time) time, SUM(intensity) intensity, (MAX(time)-MIN(time)) dif FROM df
GROUP BY ID
HAVING (MAX(time)-MIN(time))<0.3
UNION
SELECT ID, df.time, df.intensity, df2.dif
FROM (SELECT ID, AVG(time) time, SUM(intensity) intensity, (MAX(time)-MIN(time)) dif
FROM df
GROUP BY ID
HAVING (MAX(time)-MIN(time))>0.3) as df2
LEFT JOIN df USING (ID)')
輸出:
ID time intensity dif
1 A 3.15 30 0.1
2 B 6.50 30 5.8
3 B 12.30 40 5.8
4 C 3.30 110 0.2
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.