![](/img/trans.png)
[英]random subset of fixed length such that each group is present at least N times
[英]select id that appears at least (n) times in each group in r
我正在處理一個包含 ID 和時間的巨大數據集。 假設我想選擇在第 1 個月和第 2 個月至少出現兩次的 id(在實際數據中,如果它們在指定月份至少出現 15 次,我想選擇 id)。 我怎樣才能在 R 中完成這個。這是數據集
df <- data.frame(id = c(1,1,1,1,1,2,2,2,2,3,3,3),
month = c(1,1,1,2,2,1,1,2,2,2,3,3))
這是我想要的
df_1 <- data.frame(id = c(1,1,1,1,1,2,2,2,2),
month = c(1,1,1,2,2,1,1,2,2))
提前致謝!
整理宇宙:
library(dplyr)
df %>%
group_by(id, month) %>%
filter(month %in% 1:2, n() >= 2) %>%
ungroup()
# # A tibble: 9 x 2
# id month
# <dbl> <dbl>
# 1 1 1
# 2 1 1
# 3 1 1
# 4 1 2
# 5 1 2
# 6 2 1
# 7 2 1
# 8 2 2
# 9 2 2
data.table
:
library(data.table)
DT <- as.data.table(df)
DT[, .SD[month < 3 & .N >= 2,], by = .(month, id)]
# month id n
# 1: 1 1 3
# 2: 1 1 3
# 3: 1 1 3
# 4: 2 1 2
# 5: 2 1 2
# 6: 1 2 2
# 7: 1 2 2
# 8: 2 2 2
# 9: 2 2 2
一個快速而骯臟的基礎 R 解決方案。 肯定可以優化
df <- data.frame(id = c(1,1,1,1,1,2,2,2,2,3,3,3),
month = c(1,1,1,2,2,1,1,2,2,2,3,3))
# select months 1 and 2 only
sel.months <- c(1, 2)
df2 <- df[df$month %in% sel.months,]
# count ids
tb <- as.matrix(table(df2))
# get table rows that are > 2
ids <- as.integer(apply(tb, 2, function(x) {names(which(x >= 2)) }))
# remove duplicates
ids <- unique(ids)
# filter data
df.filtered <- df[df$id %in% ids,]
df.filtered
使用 dplyr 的更漂亮的解決方案
library (dplyr)
df <- data.frame(id = c(1,1,1,1,1,2,2,2,2,3,3,3),
month = c(1,1,1,2,2,1,1,2,2,2,3,3))
sel.months <- c(1, 2)
df.filtered <- df %>%
filter (month %in% sel.months) %>%
group_by (id, month) %>%
mutate (count = table(id)) %>%
filter (count >= 2)
如果要選擇在第 1 個月和第 2 個月中至少出現兩次的 ID:
library(dplyr)
df %>% group_by(id) %>% filter(sum(month %in% 1:2) >= 2)
# id month
# <dbl> <dbl>
#1 1 1
#2 1 1
#3 1 1
#4 1 2
#5 1 2
#6 2 1
#7 2 1
#8 2 2
#9 2 2
等效的data.table
:
library(data.table)
setDT(df)[, .SD[sum(month %in% 1:2) >= 2], id]
和基礎 R 解決方案:
subset(df, ave(month %in% 1:2, id, FUN = sum) >= 2)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.