[英]Subset dataframe in R, dplyr filter row values of column A not NA in row of column B
I have a dataset consisting of a time series study.我有一个由时间序列研究组成的数据集。 Since some participants didn't show up for certain days, they have NA values for rest of the data frame, but certain study days were crucial, so I am trying to subset my data to participants not missing these crucial days.
由于某些参与者在某些日子没有出现,因此他们具有数据框 rest 的 NA 值,但某些研究日期至关重要,因此我试图将我的数据子集给参与者,不要错过这些关键日子。 My dataset is actually very large but here's the general structure:
我的数据集实际上非常大,但这是一般结构:
fakedat <- data.frame(ID = c("A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C", "C",
"D", "D", "D", "D", "E", "E", "E", "E", "F", "F", "F", "F"),
StudyDay = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4,
1, 2, 3, 4),
Ab = c(10, NA, 15, 10, 10, 20, 10, NA, 10, 10, NA, 30, NA, NA, 15, NA, 10, 20,
10, 30, NA, 10, NA, 20))
Now let's say it was crucial they show up at day 2 and 4, I tried subsetting using dplyr filtering like this:现在假设它们在第 2 天和第 4 天出现至关重要,我尝试使用 dplyr 过滤进行子集设置,如下所示:
fakedat2 <- fakedat %>%
dplyr::group_by(ID) %>%
dplyr::filter(StudyDay %in% c(2, 4) & !is.na(Ab)) %>%
dplyr:: ungroup()
EDIT: But the output of this dataset is only the list if IDs that have a 2 or 4 that's not an NA value.编辑:但是这个数据集的 output 只是如果 ID 的 2 或 4 不是 NA 值的列表。 I need to find (in my real data) subjects who have NA Ab values at 4 specific Study Days.
我需要找到(在我的真实数据中)在 4 个特定研究日具有 NA Ab 值的受试者。 The answer I accepted below works but still curious about performing conditional filtering?
我在下面接受的答案有效,但仍然对执行条件过滤感到好奇? Like in SAS you could code "IF Ab.=NA at (StudyDay=2 AND StudyDay=4) THEN ID....or something like that.
就像在 SAS 中一样,您可以编写“IF Ab.=NA at (StudyDay=2 AND StudyDay=4) THEN ID....或类似的代码。
Maybe this will achieve your goal.也许这会达到你的目标。 If all participants have all
StudyDay
timepoints, and you just want to see if not missing in days 2 or 4, you can just check the Ab
values at those time points in your filter
.如果所有参与者都有所有
StudyDay
时间点,并且您只想查看第 2 天或第 4 天是否缺失,您可以在filter
中检查这些时间点的Ab
值。 In this case, an ID
will be omitted if is NA
in both days 2 and 4 (in this example, "D").在这种情况下,如果在第 2 天和第 4 天都为
NA
(在此示例中为“D”),则将省略ID
。
Alternatively, if you want to require that both values are available for days 2 and 4, you can use &
(AND) instead of |
或者,如果您希望这两个值在第 2 天和第 4 天都可用,您可以使用
&
(AND) 而不是|
(OR). (或者)。
library(dplyr)
fakedat %>%
group_by(ID) %>%
filter(!is.na(Ab[StudyDay == 2]) | !is.na(Ab[StudyDay == 4]))
If you have multiple days to check are not missing, you can use all
and check values for NA
where the StudyDay
is %in%
a vector of required days as follows:如果您有多个要检查的天数,您可以使用
all
并检查NA
的值,其中StudyDay
是%in%
所需天数的向量,如下所示:
required_vals <- c(2, 4)
fakedat %>%
group_by(ID) %>%
filter(all(!is.na(Ab[StudyDay %in% required_vals])))
Output Output
ID StudyDay Ab
<chr> <dbl> <dbl>
1 A 1 10
2 A 2 NA
3 A 3 15
4 A 4 10
5 B 1 10
6 B 2 20
7 B 3 10
8 B 4 NA
9 C 1 10
10 C 2 10
11 C 3 NA
12 C 4 30
13 E 1 10
14 E 2 20
15 E 3 10
16 E 4 30
17 F 1 NA
18 F 2 10
19 F 3 NA
20 F 4 20
In base R
, we can do在
base R
中,我们可以做
subset(fakedat, ID %in% ID[StudyDay %in% c(2, 4) & !is.na(Ab)])
-output -输出
# ID StudyDay Ab
#1 A 1 10
#2 A 2 NA
#3 A 3 15
#4 A 4 10
#5 B 1 10
#6 B 2 20
#7 B 3 10
#8 B 4 NA
#9 C 1 10
#10 C 2 10
#11 C 3 NA
#12 C 4 30
#17 E 1 10
#18 E 2 20
#19 E 3 10
#20 E 4 30
#21 F 1 NA
#22 F 2 10
#23 F 3 NA
#24 F 4 20
Or a similar option in dplyr
或
dplyr
中的类似选项
library(dplyr)
fakedat %>%
filter(ID %in% ID[StudyDay %in% c(2, 4) & !is.na(Ab)])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.