简体   繁体   中英

combine two similar columns in r

I'm trying to combine two columns of data that essentially contain the same information but some values are missing from each column that the other doesn't have. Column "wasiIQw1" holds the data for half of the group while column w1iq holds the data or the other half of the group.

select(gadd.us,nidaid,wasiIQw1,w1iq)[1:10,]

    select(gadd.us,nidaid,wasiIQw1,w1iq)[1:10,]
         nidaid wasiIQw1 w1iq
1  45-D11150341      104   NA
2  45-D11180321       82   NA
3  45-D11220022       93   93
4  45-D11240432      118   NA
5  45-D11270422       99   NA
6  45-D11290422       82   82
7  45-D11320321       99   99
8  45-D11500021       99   99
9  45-D11500311       95   95
10 45-D11520011      111  111

    select(gadd.us,nidaid,wasiIQw1,w1iq)[384:394,]
       nidaid wasiIQw1 w1iq
384 H1900442S       NA   62
385 H1930422S       NA   83
386 H1960012S       NA   89
387 H1960321S       NA   90
388 H2020011S       NA   96
389 H2020422S       NA  102
390 H2040011S       NA  102
391 H2040331S       NA   94
392 H2040422S       NA  103
393 H2050051S       NA   86
394 H2050341S       NA   98

With the following code I joined df.a (a df with the id and wasiIQw1) with df.b (a df with the id and w1iq) and get the following results.

df.join <- semi_join(df.a,
                     df.b,
                     by = "nidaid")

     nidaid w1iq
1  45-D11150341   NA
2  45-D11180321   NA
3  45-D11220022   93
4  45-D11240432   NA
5  45-D11270422   NA
6  45-D11290422   82
7  45-D11320321   99
8  45-D11500021   99
9  45-D11500311   95
10 45-D11520011  111

    nidaid w1iq
384 H1900442S   62
385 H1930422S   83
386 H1960012S   89
387 H1960321S   90
388 H2020011S   96
389 H2020422S  102
390 H2040011S  102
391 H2040331S   94
392 H2040422S  103
393 H2050051S   86
394 H2050341S   98

All of this works except for the first four "NA"s that won't merge. Other "_join" functions from dplyr have not worked either. Do you have any tips for combining theses two columns so that no data is lost but all "NA"s are filled in if the other column has a present value?

I guess you can use coalesce here which finds the first non-missing value at each position.

library(dplyr)
gadd.us %>% mutate(w1iq = coalesce(w1iq, wasiIQw1))

This will select values from w1iq if present or if w1iq is NA then it would select value from wasiIQw1 . You can switch the position of w1iq and wasiIQw1 if you want to give priority to wasiIQw1 .

Here would be a way to do it with base R (no packages)

Create reproducible data:

> dat<-data.frame(nidaid=paste0("H",c(1:5)), wasiIQw1=c(NA,NA,NA,75,9), w1iq=c(44,21,46,75,NA))
> 
> dat
  nidaid wasiIQw1 w1iq
1     H1       NA   44
2     H2       NA   21
3     H3       NA   46
4     H4       75   75
5     H5        9   NA

Create a new column named new to combine the two. With this ifelse statement, we say if the first column wasiIQw1 is not ( ! ) an 'NA' ( is.na() ), then grab it, otherwise grab the second column. Similar to Ronak's answer, you can switch the column names here to give one preference over the other.

> dat$new<-ifelse(!is.na(dat$wasiIQw1), dat$wasiIQw1, dat$w1iq)
> 
> dat
  nidaid wasiIQw1 w1iq new
1     H1       NA   44  44
2     H2       NA   21  21
3     H3       NA   46  46
4     H4       75   75  75
5     H5        9   NA   9

Using base R , we can do

 gadd.us$w1iq <- with(gadd.us, pmax(w1iq, wasiIQw1, na.rm = TRUE))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM