[英]Summarise data between given value of a categorical variable
我正在尋找一種聰明快速的方法來匯總數據幀中的數據。 數據和所需的輸出如下所示:
categoriesVector <- c("A", "A", "B", "A", "B", "B", "B", "A", "B")
propertyVector <- 1:length(categoriesVector)
dataVector <- 100 * rev(propertyVector)
df <- data.frame(categoriesVector, propertyVector, dataVector, stringsAsFactors = F)
df
desiredData <- c(700, sum(500, 400, 300), 100)
desiredProperty1 <- c(3, 5, 9)
desiredProperty2 <- c(3, 7, 9)
desiredDF <- data.frame(desiredData, desiredProperty1, desiredProperty2)
desiredDF
基本上,我需要對data
求和並在Category A
兩次出現之間保留first和last property
。 經過大量的猛烈抨擊,我發現了一個笨拙的解決方案,我希望在清晰度和性能方面尋求改進,最好使用dplyr
:
numRows <- dim(df)[1]
.groupedID <- rep(NA, numRows)
ID <- 1
.groupedID[[1]] <- ifelse(df$categoriesVector[[1]] == "A", 0, ID)
for(i in 2:numRows)
{
if(df$categoriesVector[i] == "B")
{
.groupedID[i] <- ID
if(df$categoriesVector[i - 1] == "B")
{
.groupedID[i] <- .groupedID[i - 1]
}
ID <- ID + 1
} else
{
.groupedID[i] <- 0
}
}
tempDF <-
df %>%
mutate(ID = .groupedID) %>%
filter(ID != 0) %>%
group_by(ID) %>%
summarise(desiredProperty1 = head(propertyVector, 1),
desiredProperty2 = tail(propertyVector, 1),
desiredData = sum(dataVector)) %>%
select(desiredData, desiredProperty1, desiredProperty2)
tempDF
您可以使用cumsum()
進行分組,然后根據以下內容進行處理:
df %>% mutate(Agroups = cumsum(categoriesVector == "A")) %>%
filter(categoriesVector == "B") %>%
group_by(Agroups) %>%
summarise(propertyStart = min(propertyVector),
propertyEnd = max(propertyVector),
dataTotal = sum(dataVector))
# A tibble: 3 x 4
Agroups propertyStart propertyEnd dataTotal
<int> <dbl> <dbl> <dbl>
1 2 3 3 700
2 3 5 7 1200
3 4 9 9 100
這是我對data.table
。 首先創建spanNumber
變量以標識被“ A”包圍的每個“ B”跨度,然后計算您指定的變量:
library(data.table)
setDT(df)
df[, catShiftConcat := paste0(categoriesVector, shift(categoriesVector, fill = "A"))]
df[categoriesVector == "B", spanNumber := cumsum(catShiftConcat == "BA")]
df[!is.na(spanNumber) , .(desiredData = sum(dataVector),
desiredProperty1 = propertyVector[1],
desiredProperty2 = propertyVector[.N]), by = spanNumber]
## spanNumber desiredData desiredProperty1 desiredProperty2
## 1: 1 700 3 3
## 2: 2 1200 5 7
## 3: 3 100 9 9
使用rleid
將類別向量的運行分組的另一種data.table
方法是
library(data.table)
setDT(df)[, .(categoriesVector,
desiredData=sum(dataVector),
desiredProperty1=propertyVector[1],
desiredProperty2=propertyVector[.N]),
by=rleid(categoriesVector)
][categoriesVector == "B",][, c("rleid", "categoriesVector") := NULL][]
first []
的內容返回所需的輸出,並被計算匯總到類別向量的運行中。 第二個鏈通過保留類別向量為B的那些觀察子集。第三個[]
除去兩個輔助變量,最后一個[]
在那里將結果打印到屏幕上。
這返回
desiredData desiredProperty1 desiredProperty2
1: 700 3 3
2: 1200 5 7
3: 1200 5 7
4: 1200 5 7
5: 100 9 9
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.