使用R中的2个数据帧执行简单查找

Question

In R, I have two data frames A & B as follows- 在R中，我有两个数据帧A和B，如下所示：

Data-Frame A: 数据框A：

Name      Age    City       Gender   Income    Company   ...
JXX       21     Chicago    M        20K       XYZ       ...
CXX       25     NewYork    M        30K       PQR       ...
CXX       26     Chicago    M        NA        ZZZ       ...

Data-Frame B: 数据框B：

Age    City       Gender    Avg Income  Avg Height  Avg Weight   ...
21     Chicago    M         30K         ...         ...          ...
25     NewYork    M         40K         ...         ...          ...
26     Chicago    M         50K         ...         ...          ...

I want to fill missing values in data frame A from data frame B. 我想填充数据框B中数据框A中的缺失值。

For example, for third row in data frame AI can substitute avg income from data frame B instead of exact income. 例如，对于数据框AI中的第三行，AI可以代替数据框B的平均收入，而不是确切收入。 I don't want to merge these two data frames, instead want to perform look-up like operation using Age, City and Gender columns. 我不想合并这两个数据框，而是想使用“年龄”，“城市”和“性别”列执行类似操作的查找。

Answer 1

So I think this works for Income. 所以我认为这对收入有效。 If there are only those 3 columns, you could substitute the names of the other columns in: 如果只有这三列，则可以在以下位置替换其他列的名称：

df1<-read.table(header = T, stringsAsFactors = F, text = "
Name      Age    City       Gender   Income    Company   
JXX       21     Chicago    M        20K       XYZ       
CXX       25     NewYork    M        30K       PQR       
CXX       26     Chicago    M        NA        ZZZ")       

df2<-read.table(header = T, stringsAsFactors = F, text = "

Age    City       Gender    Avg_Income 
21     Chicago    M         30K        
25     NewYork    M         40K        
26     Chicago    M         50K        ")

df1[is.na(df1$Income),]$Income<-df2[is.na(df1$Income),]$Avg_Income

It wouldn't surprise me if one of the regulars has a better way that prevents you from having to re-type the names of the columns. 如果其中一位常规者拥有更好的方法来防止您不得不重新键入列名，这也不会令我感到惊讶。

Answer 2

library(data.table);

## generate data
set.seed(5L);
NK <- 6L; pA <- 0.8; pB <- 0.2;
keydf <- unique(data.frame(Age=sample(18:65,NK,T),City=sample(c('Chicago','NewYork'),NK,T),Gender=sample(c('M','F'),NK,T),stringsAsFactors=F));
NO <- nrow(keydf)-1L;
Af <- cbind(keydf[-1L,],Name=sample(paste0(LETTERS,LETTERS,LETTERS),NO,T),Income=sample(c(NA,paste0(seq(20L,90L,10L),'K')),NO,T,c(pA,rep((1-pA)/8,8L))),stringsAsFactors=F)[sample(seq_len(NO)),];
Bf <- cbind(keydf[-2L,],`Avg Income`=sample(c(NA,paste0(seq(20L,90L,10L),'K')),NO,T,c(pB,rep((1-pB)/8,8L))),stringsAsFactors=F)[sample(seq_len(NO)),];
At <- as.data.table(Af);
Bt <- as.data.table(Bf);
At;
##    Age    City Gender Name Income
## 1:  50 NewYork      F  OOO     NA
## 2:  23 Chicago      M  SSS     NA
## 3:  62 NewYork      M  VVV     NA
## 4:  51 Chicago      F  FFF    90K
## 5:  31 Chicago      M  XXX     NA
Bt;
##    Age    City Gender Avg Income
## 1:  62 NewYork      M         NA
## 2:  51 Chicago      F        60K
## 3:  31 Chicago      M        50K
## 4:  27 NewYork      M         NA
## 5:  23 Chicago      M        60K

I generated some random test data for demonstration purposes. 我出于演示目的生成了一些随机测试数据。 I'm quite happy with the result I got with seed 5, which covers many cases: 我对种子5的结果感到非常满意，它涵盖了许多情况：

one row in A that doesn't join with B (50/NewYork/F). A中不与B相连的一行（50 / NewYork / F）。
one row in B that doesn't join with A (27/NewYork/M). B中不与A连接的一行（27 / New York / M）。
two rows that join and should result in a replacement of NA in A with a non-NA value from B (23/Chicago/M and 31/Chicago/M). 两行连接，应导致A中的NA被B中的非NA值替换（23 / Chicago / M和31 / Chicago / M）。
one row that joins but has NA in B, so shouldn't affect the NA in A (62/NewYork/M). 一行连接但在B中具有NA，因此不应影响A中的NA（62 / Nework / M）。
one row that could join, but has non-NA in A, so shouldn't take the value from B (I assumed you would want this behavior) (51/Chicago/F). 可以连接但在A中具有非NA的一行，因此不应从B中获取值（我假设您会想要这种行为）（51 / Chicago / F）。 The value in A (90K) differs from the value in B (60K), so we can verify this behavior. A（90K）中的值不同于B（60K）中的值，因此我们可以验证此行为。

And I intentionally scrambled the rows of A and B to ensure we join them correctly, regardless of incoming row order. 而且，我故意加扰了A和B的行，以确保无论输入的行顺序如何，我们都可以正确地连接它们。

## data.table solution
keys <- c('Age','City','Gender');
At[is.na(Income),Income:=Bt[.SD,on=keys,`Avg Income`]];
##    Age    City Gender Name Income
## 1:  50 NewYork      F  OOO     NA
## 2:  23 Chicago      M  SSS    60K
## 3:  62 NewYork      M  VVV     NA
## 4:  51 Chicago      F  FFF    90K
## 5:  31 Chicago      M  XXX    50K

In the above I filter for NA values in A first, then do a join in the j argument on the key columns and assign in-place the source column to the target column using the data.table := syntax. 在上面的代码中，我首先过滤了A中的NA值，然后在键列的j参数中进行了连接，然后使用data.table :=语法将源列适当地分配给目标列。

Note that in the data.table world X[Y] does a right join , so if you want a left join you need to reverse it to Y[X] (with "left" now referring to X , counter-intuitively). 请注意，在data.table世界中， X[Y]进行右连接 ，因此，如果要左连接 ，则需要将其反转为Y[X] （“ left”现在直指X ）。 That's why I used Bt[.SD] instead of (the likely more natural expectation of) .SD[Bt] . 这就是为什么我使用Bt[.SD]而不是.SD[Bt] （可能更自然的期望）的.SD[Bt] 。 We need a left join on .SD because the result of the join index expression will be assigned in-place to the target column, and so the RHS of the assignment must be a full vector correspondent to the target column. 我们需要在.SD上进行左连接，因为连接索引表达式的结果将就地分配给目标列，因此分配的RHS必须是与目标列对应的完整向量。

You can repeat the in-place assignment line for each column you want to replace. 您可以为要替换的每一列重复就地分配行。

## base R solution
keys <- c('Age','City','Gender');
m <- merge(cbind(Af[keys],Ai=seq_len(nrow(Af))),cbind(Bf[keys],Bi=seq_len(nrow(Bf))))[c('Ai','Bi')];
m;
##   Ai Bi
## 1  2  5
## 2  5  3
## 3  4  2
## 4  3  1
mi <- which(is.na(Af$Income[m$Ai])); Af$Income[m$Ai[mi]] <- Bf$`Avg Income`[m$Bi[mi]];
Af;
##   Age    City Gender Name Income
## 2  50 NewYork      F  OOO   <NA>
## 5  23 Chicago      M  SSS    60K
## 3  62 NewYork      M  VVV   <NA>
## 6  51 Chicago      F  FFF    90K
## 4  31 Chicago      M  XXX    50K

I guess I was feeling a little bit creative here, so for a base R solution I did something that's probably a little unusual, and which I've never done before. 我想我在这里感觉有点创意，因此对于基本的R解决方案，我做了一些可能不寻常的事情，而我以前从未做过。 I column-bound a synthesized row index column into the key-column subset of each of the A and B data.frames, then called merge() to join them (note that this is an inner join , since we don't need any kind of outer join here), and extracted just the row index columns that resulted from the join. 我将一个合成的行索引列绑定到A和B data.frames的每个键列子集中，然后调用merge()将它们连接起来（请注意，这是一个内部连接 ，因为我们不需要任何连接）一种外部联接），并仅提取联接产生的行索引列。 This effectively precomputes the joined pairs of rows for all subsequent modification operations. 这可以有效地为所有后续修改操作预先计算连接的行对。

For the modification, I precompute the subset of the join pairs for which the row in A satisfies the replacement condition, eg that its Income value is NA for the Income replacement. 对于修改，我预先计算了A中的行满足替换条件的联接对的子集，例如，对于Income替换，其Income值为NA。 We can then subset the join pair table for those rows, and do a direct assignment from B to A to carry out the replacement. 然后，我们可以为这些行的联接对表子集，并从B到A进行直接分配以进行替换。

As before, you can repeat the assignment line for every column you want to replace. 和以前一样，您可以为要替换的每一列重复分配行。

Answer 3

You can simply use the following to update the average income of the city from B to the income in A. 您只需使用以下内容即可将城市的平均收入从B更新为A中的收入。

dataFrameA$Income = dataFrameB$`Avg Income`[match(dataFrameA$City, dataFrameB$City)] dataFrameA $ Income = dataFrameB $`平均收入`[match（dataFrameA $ City，dataFrameB $ City）]

you'll have to use "`" if the column name has a space 如果列名带有空格，则必须使用“`”

this is similar to using a lookup using index and match in excel. 这类似于在Excel中使用索引和匹配进行查找。 I'm assuming you're coming from excel. 我假设您来自Excel。 The code will be more compact if you use data.table 如果使用data.table，代码将更加紧凑。

使用R中的2个数据帧执行简单查找

问题描述

Data-Frame A: 数据框A：

Data-Frame B: 数据框B：

3 个解决方案

解决方案1
1 2016-06-02 00:37:44

解决方案2
1 已采纳 2016-06-02 02:36:14

解决方案3
0 2017-04-04 11:20:00

使用R中的2个数据帧执行简单查找

问题描述

Data-Frame A: 数据框A：

Data-Frame B: 数据框B：

3 个解决方案

解决方案1 1 2016-06-02 00:37:44

解决方案2 1 已采纳 2016-06-02 02:36:14

解决方案3 0 2017-04-04 11:20:00

解决方案1
1 2016-06-02 00:37:44

解决方案2
1 已采纳 2016-06-02 02:36:14

解决方案3
0 2017-04-04 11:20:00