简体   繁体   English

使用R中的2个数据帧执行简单查找

[英]Performing simple lookup using 2 data frames in R

In R, I have two data frames A & B as follows- 在R中,我有两个数据帧A和B,如下所示:

Data-Frame A: 数据框A:

Name      Age    City       Gender   Income    Company   ...
JXX       21     Chicago    M        20K       XYZ       ...
CXX       25     NewYork    M        30K       PQR       ...
CXX       26     Chicago    M        NA        ZZZ       ...

Data-Frame B: 数据框B:

Age    City       Gender    Avg Income  Avg Height  Avg Weight   ...
21     Chicago    M         30K         ...         ...          ...
25     NewYork    M         40K         ...         ...          ...
26     Chicago    M         50K         ...         ...          ...

I want to fill missing values in data frame A from data frame B. 我想填充数据框B中数据框A中的缺失值。

For example, for third row in data frame AI can substitute avg income from data frame B instead of exact income. 例如,对于数据框AI中的第三行,AI可以代替数据框B的平均收入,而不是确切收入。 I don't want to merge these two data frames, instead want to perform look-up like operation using Age, City and Gender columns. 我不想合并这两个数据框,而是想使用“年龄”,“城市”和“性别”列执行类似操作的查找。

So I think this works for Income. 所以我认为这对收入有效。 If there are only those 3 columns, you could substitute the names of the other columns in: 如果只有这三列,则可以在以下位置替换其他列的名称:

df1<-read.table(header = T, stringsAsFactors = F, text = "
Name      Age    City       Gender   Income    Company   
JXX       21     Chicago    M        20K       XYZ       
CXX       25     NewYork    M        30K       PQR       
CXX       26     Chicago    M        NA        ZZZ")       

df2<-read.table(header = T, stringsAsFactors = F, text = "

Age    City       Gender    Avg_Income 
21     Chicago    M         30K        
25     NewYork    M         40K        
26     Chicago    M         50K        ")

df1[is.na(df1$Income),]$Income<-df2[is.na(df1$Income),]$Avg_Income

It wouldn't surprise me if one of the regulars has a better way that prevents you from having to re-type the names of the columns. 如果其中一位常规者拥有更好的方法来防止您不得不重新键入列名,这也不会令我感到惊讶。

library(data.table);

## generate data
set.seed(5L);
NK <- 6L; pA <- 0.8; pB <- 0.2;
keydf <- unique(data.frame(Age=sample(18:65,NK,T),City=sample(c('Chicago','NewYork'),NK,T),Gender=sample(c('M','F'),NK,T),stringsAsFactors=F));
NO <- nrow(keydf)-1L;
Af <- cbind(keydf[-1L,],Name=sample(paste0(LETTERS,LETTERS,LETTERS),NO,T),Income=sample(c(NA,paste0(seq(20L,90L,10L),'K')),NO,T,c(pA,rep((1-pA)/8,8L))),stringsAsFactors=F)[sample(seq_len(NO)),];
Bf <- cbind(keydf[-2L,],`Avg Income`=sample(c(NA,paste0(seq(20L,90L,10L),'K')),NO,T,c(pB,rep((1-pB)/8,8L))),stringsAsFactors=F)[sample(seq_len(NO)),];
At <- as.data.table(Af);
Bt <- as.data.table(Bf);
At;
##    Age    City Gender Name Income
## 1:  50 NewYork      F  OOO     NA
## 2:  23 Chicago      M  SSS     NA
## 3:  62 NewYork      M  VVV     NA
## 4:  51 Chicago      F  FFF    90K
## 5:  31 Chicago      M  XXX     NA
Bt;
##    Age    City Gender Avg Income
## 1:  62 NewYork      M         NA
## 2:  51 Chicago      F        60K
## 3:  31 Chicago      M        50K
## 4:  27 NewYork      M         NA
## 5:  23 Chicago      M        60K

I generated some random test data for demonstration purposes. 我出于演示目的生成了一些随机测试数据。 I'm quite happy with the result I got with seed 5, which covers many cases: 我对种子5的结果感到非常满意,它涵盖了许多情况:

  • one row in A that doesn't join with B (50/NewYork/F). A中不与B相连的一行(50 / NewYork / F)。
  • one row in B that doesn't join with A (27/NewYork/M). B中不与A连接的一行(27 / New York / M)。
  • two rows that join and should result in a replacement of NA in A with a non-NA value from B (23/Chicago/M and 31/Chicago/M). 两行连接,应导致A中的NA被B中的非NA值替换(23 / Chicago / M和31 / Chicago / M)。
  • one row that joins but has NA in B, so shouldn't affect the NA in A (62/NewYork/M). 一行连接但在B中具有NA,因此不应影响A中的NA(62 / Nework / M)。
  • one row that could join, but has non-NA in A, so shouldn't take the value from B (I assumed you would want this behavior) (51/Chicago/F). 可以连接但在A中具有非NA的一行,因此不应从B中获取值(我假设您会想要这种行为)(51 / Chicago / F)。 The value in A (90K) differs from the value in B (60K), so we can verify this behavior. A(90K)中的值不同于B(60K)中的值,因此我们可以验证此行为。

And I intentionally scrambled the rows of A and B to ensure we join them correctly, regardless of incoming row order. 而且,我故意加扰了A和B的行,以确保无论输入的行顺序如何,我们都可以正确地连接它们。


## data.table solution
keys <- c('Age','City','Gender');
At[is.na(Income),Income:=Bt[.SD,on=keys,`Avg Income`]];
##    Age    City Gender Name Income
## 1:  50 NewYork      F  OOO     NA
## 2:  23 Chicago      M  SSS    60K
## 3:  62 NewYork      M  VVV     NA
## 4:  51 Chicago      F  FFF    90K
## 5:  31 Chicago      M  XXX    50K

In the above I filter for NA values in A first, then do a join in the j argument on the key columns and assign in-place the source column to the target column using the data.table := syntax. 在上面的代码中,我首先过滤了A中的NA值,然后在键列的j参数中进行了连接,然后使用data.table :=语法将源列适当地分配给目标列。

Note that in the data.table world X[Y] does a right join , so if you want a left join you need to reverse it to Y[X] (with "left" now referring to X , counter-intuitively). 请注意,在data.table世界中, X[Y]进行右连接 ,因此,如果要左连接 ,则需要将其反转为Y[X] (“ left”现在直指X )。 That's why I used Bt[.SD] instead of (the likely more natural expectation of) .SD[Bt] . 这就是为什么我使用Bt[.SD]而不是.SD[Bt] (可能更自然的期望)的.SD[Bt] We need a left join on .SD because the result of the join index expression will be assigned in-place to the target column, and so the RHS of the assignment must be a full vector correspondent to the target column. 我们需要在.SD上进行左连接,因为连接索引表达式的结果将就地分配给目标列,因此分配的RHS必须是与目标列对应的完整向量。

You can repeat the in-place assignment line for each column you want to replace. 您可以为要替换的每一列重复就地分配行。


## base R solution
keys <- c('Age','City','Gender');
m <- merge(cbind(Af[keys],Ai=seq_len(nrow(Af))),cbind(Bf[keys],Bi=seq_len(nrow(Bf))))[c('Ai','Bi')];
m;
##   Ai Bi
## 1  2  5
## 2  5  3
## 3  4  2
## 4  3  1
mi <- which(is.na(Af$Income[m$Ai])); Af$Income[m$Ai[mi]] <- Bf$`Avg Income`[m$Bi[mi]];
Af;
##   Age    City Gender Name Income
## 2  50 NewYork      F  OOO   <NA>
## 5  23 Chicago      M  SSS    60K
## 3  62 NewYork      M  VVV   <NA>
## 6  51 Chicago      F  FFF    90K
## 4  31 Chicago      M  XXX    50K

I guess I was feeling a little bit creative here, so for a base R solution I did something that's probably a little unusual, and which I've never done before. 我想我在这里感觉有点创意,因此对于基本的R解决方案,我做了一些可能不寻常的事情,而我以前从未做过。 I column-bound a synthesized row index column into the key-column subset of each of the A and B data.frames, then called merge() to join them (note that this is an inner join , since we don't need any kind of outer join here), and extracted just the row index columns that resulted from the join. 我将一个合成的行索引列绑定到A和B data.frames的每个键列子集中,然后调用merge()将它们连接起来(请注意,这是一个内部连接 ,因为我们不需要任何连接)一种外部联接),并仅提取联接产生的行索引列。 This effectively precomputes the joined pairs of rows for all subsequent modification operations. 这可以有效地为所有后续修改操作预先计算连接的行对。

For the modification, I precompute the subset of the join pairs for which the row in A satisfies the replacement condition, eg that its Income value is NA for the Income replacement. 对于修改,我预先计算了A中的行满足替换条件的联接对的子集,例如,对于Income替换,其Income值为NA。 We can then subset the join pair table for those rows, and do a direct assignment from B to A to carry out the replacement. 然后,我们可以为这些行的联接对表子集,并从B到A进行直接分配以进行替换。

As before, you can repeat the assignment line for every column you want to replace. 和以前一样,您可以为要替换的每一列重复分配行。

You can simply use the following to update the average income of the city from B to the income in A. 您只需使用以下内容即可将城市的平均收入从B更新为A中的收入。

dataFrameA$Income = dataFrameB$`Avg Income`[match(dataFrameA$City, dataFrameB$City)] dataFrameA $ Income = dataFrameB $`平均收入`[match(dataFrameA $ City,dataFrameB $ City)]

you'll have to use "`" if the column name has a space 如果列名带有空格,则必须使用“`”

this is similar to using a lookup using index and match in excel. 这类似于在Excel中使用索引和匹配进行查找。 I'm assuming you're coming from excel. 我假设您来自Excel。 The code will be more compact if you use data.table 如果使用data.table,代码将更加紧凑。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM