简体   繁体   中英

Subsetting data by multiple values in multiple variables in R

Lets say I have this dataset:

data1 = sample(1:250, 250)
data2 = sample(1:250, 250)
data <- data.frame(data1,data2)

If I want to subset 'data' by 30 values in both 'data1' and 'data2' what would be the best way to do that? For example, from 'data' I want to select all rows where data1= 4 or 12 or 13 or 24 and data2= 4 or 12 or 13 or 24 and data2= 4 or 12 or 13 or 24. I want rows where both conditions are true.

I wrote this out like:

subdata <- subset(data, data1 == 4 |data1 == 12 |data1 == 13 |data1 == 24 & data2 == 4 |data2 == 12 |data2 == 13 |data2 == 24)

But this doesn't seem meet both conditions, rather it's one or the other.

Note that in your original subset , you didn't wrap your | tests for data1 and data2 in brackets. This produces the wrong subset of "data1= 4 or 12 or 13 or 24 OR data2= 4 or 12 or 13 or 24". You actually want:

subdata <- subset(data, (data1 == 4 |data1 == 12 |data1 == 13 |data1 == 24) & (data2 == 4 |data2 == 12 |data2 == 13 |data2 == 24))

Here is how you would modify your subset function with %in% :

subdata <- subset(data, (data1 %in% c(4, 12, 13, 24)) & (data2 %in% c(4, 12, 13, 24)))

Below I provide an elegant dplyr approach with filter_all :

library(dplyr)
data %>%
  filter_all(all_vars(. %in% c(4, 12, 13, 24)))

Note:

Your sample functions do not easily produce sample data where the tests are actually true. As a result the above solution would likely return zero rows. I've therefore modified your sample dataset to produce rows that actually have matches that you can subset.

Data:

set.seed(1)
data1 = sample(c(4, 12, 13, 24, 100, 123), 500, replace = TRUE)
data2 = sample(c(4, 12, 13, 24, 100, 123), 500, replace = TRUE)
data <- data.frame(data1,data2)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM