简体   繁体   中英

R - Compare column values in data frames of differing lengths by unique ID

I'm sure I can figure out a straightforward solution to this problem, but I didn't see a comparable question so I thought I'd post a question.

I have a longitudinal dataset with thousands of respondents over several time intervals. Everything from the questions to the data types can differ between the waves and often requires constructing long series of bools to construct indicators or dummy variables, but each respondent has a unique ID with no additional respondents add to the surveys after the first wave, so easy enough.

The issue is that while the early wave consist of one (Stata) file each, the latter waves contain lots of addendum files, structured differently. So, for example, in constructing previous indicators for the sex of previous partners there were columns (for one wave) called partnerNum and sex and there were up to 16 rows for each unique ID (respondent). Easy enough to spread (or cast) that data to be able to create a single row for each unique ID and columns partnerNum_1 ... partnerNum_16 with the value from the sex column as the entry in partnerDF. Then it's easy to construct indicators like:

sexuality$newIndicator[mainDF$bioSex = "Male" & apply(partnerDF[1:16] == "Male", 1, any)] <- 1

For other addendum files in the last two waves the data is structured long like the partner data, with multiple rows for each unique ID, but rather than just one variable like sex there are hundreds that I need to use to test against to construct indicators, all coded with different types, so it's impractical to spread (or cast) the data wide (never mind writing those bools). There are actually several of these files for each wave and the way they are structured some respondents (unique ID) occupy just 1 row, some a few dozen. (I've left_join'ed the addendum files together for each wave.)

What I'd like to be able to do to is test something like:

newDF$indicator[any(waveIIIAdds$var1 == 1) & any(waveIIIAdds$var2 == 1)] <- 1

or

newDF$indicator[mainDF$var1 == 1 & any(waveIIIAdds$var2 == 1)] <- 1

where newDF is the same length as mainDF (one row per unique ID).

So, for example, if I had two dfs.

df1 <- data.frame(ID = c(1:4), A = rep("a"))
df2 <- data.frame(ID = rep(1:4, each=2), B = rep(1:2, 2), stringsAsFactors = FALSE)
df1$A[1] <- "b"
df1$A[3] <- "b"
df2$B[8] <- 3

> df1   > df2
ID A    ID B
1  b    1  1
2  a    1  2
3  b    2  1
4  a    2  2
        3  1
        3  2
        4  1
        4  3

I'd like to test like (assuming df3 has one column, just the unique IDs from df1)

df3$new <- 0
df3$new[df1$ID[df1$A == "a"] & df2$ID[df2$B == 2]] <- 1

So that df3 would have one unique ID per row and since there is an "a" in df1$A for all IDs but df1$A[1] and a 2 in at least one row of df2$B for all IDs except the last ID (df2$B[7:8]) the result would be:

> df3
ID new
1  0
2  1
3  1
4  0

and

df3$new <- 0
df3$new[df1$ID[df1$A == "a"] | df2$ID[df2$B == 2]] <- 1

> df3
ID new
1  1
2  1
3  1
4  0

This does it...

df3 <- data.frame(ID=unique(df1$ID),
                  new=sapply(unique(df1$ID),function(x) 
                     as.numeric(x %in% df1$ID[df1$A == "a"] & x %in% df2$ID[df2$B == 2])))

df3
  ID new
1  1   1
2  2   1
3  3   1
4  4   0

I came up with a parsimonious solution thinking about it for a few minutes after returning to the problem (rather than the wee hours of the morning of the post).

I wanted something a graduate student who will likely construct thousands of indicators or dummy variables this way and may learn R first, or even only ever learn R, could use. The following provides a solution for the example and actual data using the same schema:

if the DF was already created with the IDs and the column values for the dummy indicator initiated to zero already as assumed in the example:

df3 <- data.frame(ID = df1$ID)
df3$new <- 0

My solution was:

df3$new[df1$ID %in% df1$ID[df1$A == "a"] & df1$ID %in% df2$ID[df2$B == 2]] <- 1 

> df3
ID new
1   0
2   1
3   0
4   1

Using | (or) instead:

df3$new[df1$ID %in% df1$ID[df1$A == "a"] | df1$ID %in% df2$ID[df2$B == 2]] <- 1 

> df3
ID new
1   1
2   1
3   0
4   1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM