[英]Group rows in dataframe by common elements using R
我有一個數據集,其中不同的行具有不同的元素組合,並且我想提取出具有相同元素組合的行組。 對於此示例數據集:
id <- c("A", "B", "C", "D")
X1 <- c(NA,NA,NA,"X1")
X2 <- c(NA,NA,"X2","X2")
X3 <- c("X3","X3","X3","X3")
X4 <- c("X4", "X4", "X4", "X4")
df <- data.frame(id,X1,X2,X3,X4)
> df
id X1 X2 X3 X4
1 A <NA> <NA> X3 X4
2 B <NA> <NA> X3 X4
3 C <NA> X2 X3 X4
4 D X1 X2 X3 X4
我希望能夠退出
我嘗試將數據框拆分為列表並刪除空單元格,以便每個id在列表中獲得自己的data.frame:
df.list <- split(df, seq(nrow(df)))
dfComplete.list <- lapply(df.list, function(remNA) remNA[,colSums(is.na(remNA)) < nrow(remNA)])
這讓我
> dfComplete.list
$`1`
id X3 X4
1 1 X3 X4
$`2`
id X3 X4
2 2 X3 X4
$`3`
id X2 X3 X4
3 3 X2 X3 X4
$`4`
id X1 X2 X3 X4
4 4 X1 X2 X3 X4
我很困惑從這里去哪里。 有沒有一種方法可以根據它們共有的元素/列在列表中對數據框進行分組?
我正在使用的實際數據集實際上具有X7至X17的元素/列,並且每個id都在1到4個元素之間,因此理想的解決方案將能夠識別出數據中存在的所有元素組合。
最后,在將數據重整為上述格式之前,我的數據最初采用以下較長格式,以防萬一有一種更簡便的方法可以從原始格式中找到解決方案:
id <- c("A", "A", "B", "B", "C", "C", "C", "D", "D", "D", "D")
elements <- c("X3", "X4", "X3", "X4", "X2", "X3", "X4", "X1", "X2", "X3", "X4")
dataLong <- data.frame(id, elements)
> dataLong
id elements
1 A X3
2 A X4
3 B X3
4 B X4
5 C X2
6 C X3
7 C X4
8 D X1
9 D X2
10 D X3
11 D X4
在此先感謝您的幫助!
reshape2::dcast
函數可以幫助將數據從長格式轉換為OP期望的格式。
#Data
id <- c("A", "A", "B", "B", "C", "C", "C", "D", "D", "D", "D")
elements <- c("X3", "X4", "X3", "X4", "X2", "X3", "X4", "X1", "X2", "X3", "X4")
dataLong <- data.frame(id, elements, stringsAsFactors = FALSE)
library(reshape2)
#Use dcast to get the result
dataLong %>% dcast(id~elements)
# id X1 X2 X3 X4
# 1 A <NA> <NA> X3 X4
# 2 B <NA> <NA> X3 X4
# 3 C <NA> X2 X3 X4
# 4 D X1 X2 X3 X4
我了解您想計算獨特的組合。 這就是我要做的
library(dplyr)
library(tidyr)
dataLong %>% mutate(value=1) %>%
spread(elements, value) %>%
select(-id) %>%
group_by_all() %>%
summarise(count=n()) %>% ungroup()
#> # A tibble: 3 x 5
#> X1 X2 X3 X4 count
#> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 1 1 1 1 1
#> 2 NA 1 1 1 1
#> 3 NA NA 1 1 2
您可以為此使用tidyverse
! arrange()
的使用有點多余,但是我想向您展示該選項,因為它將安排您的數據框以反映您感興趣的分組(您可以將其視為一種嵌套排序)。 這可能就是您所需要的。
如果您需要實際計數,以及想要告訴您哪些ID與哪些組合相對應的列,則只需運行下面的完整代碼即可。 請注意,您將必須在完整代碼中添加所有變量( X7:X17
)。 在聲明數據stringsAsFactors = FALSE
時,您還需要使用stringsAsFactors = FALSE
,這通常是一種好習慣。
# Your example dataframe. Make sure to set stringsAsFactors = FALSE
id <- c("A", "B", "C", "D")
X1 <- c(NA,NA,NA,"X1")
X2 <- c(NA,NA,"X2","X2")
X3 <- c("X3","X3","X3","X3")
X4 <- c("X4", "X4", "X4", "X4")
df <- data.frame(id,X1,X2,X3,X4, stringsAsFactors = FALSE)
# We group rows by all unique combinations and then collapse those rows,
# while recording which ids belong to which grouping, and how many there are
# in each.
library(tidyverse)
ndf <- arrange(df, X1,X2,X3,X4) %>%
group_by(X1,X2,X3,X4) %>%
summarise(num = n(), id = paste(id, collapse=","))
# Output:
# A tibble: 3 x 6
# Groups: X1, X2, X3 [?]
X1 X2 X3 X4 num id
<chr> <chr> <chr> <chr> <int> <chr>
1 X1 X2 X3 X4 1 D
2 <NA> X2 X3 X4 1 C
3 <NA> <NA> X3 X4 2 A,B
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.