简体   繁体   中英

Performing simple lookup using 2 data frames in R

In R, I have two data frames A & B as follows-

Data-Frame A:

Name      Age    City       Gender   Income    Company   ...
JXX       21     Chicago    M        20K       XYZ       ...
CXX       25     NewYork    M        30K       PQR       ...
CXX       26     Chicago    M        NA        ZZZ       ...

Data-Frame B:

Age    City       Gender    Avg Income  Avg Height  Avg Weight   ...
21     Chicago    M         30K         ...         ...          ...
25     NewYork    M         40K         ...         ...          ...
26     Chicago    M         50K         ...         ...          ...

I want to fill missing values in data frame A from data frame B.

For example, for third row in data frame AI can substitute avg income from data frame B instead of exact income. I don't want to merge these two data frames, instead want to perform look-up like operation using Age, City and Gender columns.

So I think this works for Income. If there are only those 3 columns, you could substitute the names of the other columns in:

df1<-read.table(header = T, stringsAsFactors = F, text = "
Name      Age    City       Gender   Income    Company   
JXX       21     Chicago    M        20K       XYZ       
CXX       25     NewYork    M        30K       PQR       
CXX       26     Chicago    M        NA        ZZZ")       

df2<-read.table(header = T, stringsAsFactors = F, text = "

Age    City       Gender    Avg_Income 
21     Chicago    M         30K        
25     NewYork    M         40K        
26     Chicago    M         50K        ")

df1[is.na(df1$Income),]$Income<-df2[is.na(df1$Income),]$Avg_Income

It wouldn't surprise me if one of the regulars has a better way that prevents you from having to re-type the names of the columns.

library(data.table);

## generate data
set.seed(5L);
NK <- 6L; pA <- 0.8; pB <- 0.2;
keydf <- unique(data.frame(Age=sample(18:65,NK,T),City=sample(c('Chicago','NewYork'),NK,T),Gender=sample(c('M','F'),NK,T),stringsAsFactors=F));
NO <- nrow(keydf)-1L;
Af <- cbind(keydf[-1L,],Name=sample(paste0(LETTERS,LETTERS,LETTERS),NO,T),Income=sample(c(NA,paste0(seq(20L,90L,10L),'K')),NO,T,c(pA,rep((1-pA)/8,8L))),stringsAsFactors=F)[sample(seq_len(NO)),];
Bf <- cbind(keydf[-2L,],`Avg Income`=sample(c(NA,paste0(seq(20L,90L,10L),'K')),NO,T,c(pB,rep((1-pB)/8,8L))),stringsAsFactors=F)[sample(seq_len(NO)),];
At <- as.data.table(Af);
Bt <- as.data.table(Bf);
At;
##    Age    City Gender Name Income
## 1:  50 NewYork      F  OOO     NA
## 2:  23 Chicago      M  SSS     NA
## 3:  62 NewYork      M  VVV     NA
## 4:  51 Chicago      F  FFF    90K
## 5:  31 Chicago      M  XXX     NA
Bt;
##    Age    City Gender Avg Income
## 1:  62 NewYork      M         NA
## 2:  51 Chicago      F        60K
## 3:  31 Chicago      M        50K
## 4:  27 NewYork      M         NA
## 5:  23 Chicago      M        60K

I generated some random test data for demonstration purposes. I'm quite happy with the result I got with seed 5, which covers many cases:

  • one row in A that doesn't join with B (50/NewYork/F).
  • one row in B that doesn't join with A (27/NewYork/M).
  • two rows that join and should result in a replacement of NA in A with a non-NA value from B (23/Chicago/M and 31/Chicago/M).
  • one row that joins but has NA in B, so shouldn't affect the NA in A (62/NewYork/M).
  • one row that could join, but has non-NA in A, so shouldn't take the value from B (I assumed you would want this behavior) (51/Chicago/F). The value in A (90K) differs from the value in B (60K), so we can verify this behavior.

And I intentionally scrambled the rows of A and B to ensure we join them correctly, regardless of incoming row order.


## data.table solution
keys <- c('Age','City','Gender');
At[is.na(Income),Income:=Bt[.SD,on=keys,`Avg Income`]];
##    Age    City Gender Name Income
## 1:  50 NewYork      F  OOO     NA
## 2:  23 Chicago      M  SSS    60K
## 3:  62 NewYork      M  VVV     NA
## 4:  51 Chicago      F  FFF    90K
## 5:  31 Chicago      M  XXX    50K

In the above I filter for NA values in A first, then do a join in the j argument on the key columns and assign in-place the source column to the target column using the data.table := syntax.

Note that in the data.table world X[Y] does a right join , so if you want a left join you need to reverse it to Y[X] (with "left" now referring to X , counter-intuitively). That's why I used Bt[.SD] instead of (the likely more natural expectation of) .SD[Bt] . We need a left join on .SD because the result of the join index expression will be assigned in-place to the target column, and so the RHS of the assignment must be a full vector correspondent to the target column.

You can repeat the in-place assignment line for each column you want to replace.


## base R solution
keys <- c('Age','City','Gender');
m <- merge(cbind(Af[keys],Ai=seq_len(nrow(Af))),cbind(Bf[keys],Bi=seq_len(nrow(Bf))))[c('Ai','Bi')];
m;
##   Ai Bi
## 1  2  5
## 2  5  3
## 3  4  2
## 4  3  1
mi <- which(is.na(Af$Income[m$Ai])); Af$Income[m$Ai[mi]] <- Bf$`Avg Income`[m$Bi[mi]];
Af;
##   Age    City Gender Name Income
## 2  50 NewYork      F  OOO   <NA>
## 5  23 Chicago      M  SSS    60K
## 3  62 NewYork      M  VVV   <NA>
## 6  51 Chicago      F  FFF    90K
## 4  31 Chicago      M  XXX    50K

I guess I was feeling a little bit creative here, so for a base R solution I did something that's probably a little unusual, and which I've never done before. I column-bound a synthesized row index column into the key-column subset of each of the A and B data.frames, then called merge() to join them (note that this is an inner join , since we don't need any kind of outer join here), and extracted just the row index columns that resulted from the join. This effectively precomputes the joined pairs of rows for all subsequent modification operations.

For the modification, I precompute the subset of the join pairs for which the row in A satisfies the replacement condition, eg that its Income value is NA for the Income replacement. We can then subset the join pair table for those rows, and do a direct assignment from B to A to carry out the replacement.

As before, you can repeat the assignment line for every column you want to replace.

You can simply use the following to update the average income of the city from B to the income in A.

dataFrameA$Income = dataFrameB$`Avg Income`[match(dataFrameA$City, dataFrameB$City)]

you'll have to use "`" if the column name has a space

this is similar to using a lookup using index and match in excel. I'm assuming you're coming from excel. The code will be more compact if you use data.table

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM