[英]Aggregating regardless of the order of columns
我想將數據幀聚合兩列,以便它們的變化只存在一次。 值列應由聚合函數聚合,如max()
或sum()
數據:
itemID1 |itemID2 |value
---------|---------|-------
B0001 |B0001 |1
B0002 |B0001 |1
B0001 |B0002 |2
B0002 |B0002 |0
結果可能是:
itemID1 |itemID2 |value
----------|----------|---------
B0001 |B0001 |1
B0001 |B0002 |3 #itemIDs could also be ordered in the other way
B0002 |B0002 |0
到目前為止,我已經在SQL中實現它以通過庫sqldf使用它,但是sqldf不支持WITH子句。
是否有可能直接在R中聚合這樣的數據幀?
在base R
,但它復制了數據,因為我在復制上保持原始原樣不變。
dat2 <- dat
dat2[1:2] <- apply(dat2[1:2], 1, sort)
aggregate(value ~ itemID1 + itemID2, dat2, sum)
# itemID1 itemID2 value
#1 B0001 B0001 1
#2 B0001 B0002 3
#3 B0002 B0002 0
現在你可以rm(dat2)
來整理。
數據。
dat <-
structure(list(itemID1 = structure(c(1L, 2L, 1L, 2L), .Label = c("B0001",
"B0002"), class = "factor"), itemID2 = structure(c(1L, 1L, 2L,
2L), .Label = c("B0001", "B0002"), class = "factor"), value = c(1L,
1L, 2L, 0L)), .Names = c("itemID1", "itemID2", "value"), class = "data.frame", row.names = c(NA,
-4L))
使用dplyr
和pmin
/ pmax
:
library(dplyr)
df1 %>%
mutate(ItemID1_ = pmin(itemID1 ,itemID2),
ItemID2_ = pmax(itemID1 ,itemID2)) %>%
group_by(ItemID1_,ItemID2_) %>%
summarize_at("value",sum) %>%
ungroup
# # A tibble: 3 x 3
# ItemID1_ ItemID2_ value
# <chr> <chr> <int>
# 1 B0001 B0001 1
# 2 B0001 B0002 3
# 3 B0002 B0002 0
關注@ A5C1D2H2I1M1N2O1R2T1的評論后,您可以跳過mutate部分並使用相同的輸出:
df1 %>%
group_by(itemID1_ = pmin(itemID1, itemID2),
itemID2_ = pmax(itemID1, itemID2)) %>%
summarise_at("value", sum) %>%
ungroup
如果您想堅持使用sqldf
這是另一種解決方案:
library(sqldf)
sqldf("select itemID1, itemID2, sum(value) as value
from (select case when itemID1 <= itemID2 then itemID1 else itemID2 end as itemID1,
case when itemID1 > itemID2 then itemID1 else itemID2 end as itemID2,
value
from df)
group by itemID1, itemID2")
結果:
itemID1 itemID2 value
1 B0001 B0001 1
2 B0001 B0002 3
3 B0002 B0002 0
數據:
df = structure(list(itemID1 = structure(c(1L, 2L, 1L, 2L), .Label = c("B0001",
"B0002"), class = "factor"), itemID2 = structure(c(1L, 1L, 2L,
2L), .Label = c("B0001", "B0002"), class = "factor"), value = c(1L,
1L, 2L, 0L)), .Names = c("itemID1", "itemID2", "value"), class = "data.frame", row.names = c(NA,
-4L))
為了完整起見,這里也是一個data.table
解決方案:
library(data.table)
setDT(DT)[, .(value = sum(value)),
by = .(itemID1 = pmin(itemID1, itemID2), itemID2 = pmax(itemID1, itemID2))]
itemID1 itemID2 value 1: B0001 B0001 1 2: B0001 B0002 3 3: B0002 B0002 0
DT <- fread("itemID1 |itemID2 |value
B0001 |B0001 |1
B0002 |B0001 |1
B0001 |B0002 |2
B0002 |B0002 |0", sep = "|")
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.