[英]Performing simple lookup using 2 data frames in R
In R, I have two data frames A & B as follows- 在R中,我有两个数据帧A和B,如下所示:
Name Age City Gender Income Company ...
JXX 21 Chicago M 20K XYZ ...
CXX 25 NewYork M 30K PQR ...
CXX 26 Chicago M NA ZZZ ...
Age City Gender Avg Income Avg Height Avg Weight ...
21 Chicago M 30K ... ... ...
25 NewYork M 40K ... ... ...
26 Chicago M 50K ... ... ...
I want to fill missing values in data frame A from data frame B. 我想填充数据框B中数据框A中的缺失值。
For example, for third row in data frame AI can substitute avg income from data frame B instead of exact income. 例如,对于数据框AI中的第三行,AI可以代替数据框B的平均收入,而不是确切收入。 I don't want to merge these two data frames, instead want to perform look-up like operation using Age, City and Gender columns.
我不想合并这两个数据框,而是想使用“年龄”,“城市”和“性别”列执行类似操作的查找。
So I think this works for Income. 所以我认为这对收入有效。 If there are only those 3 columns, you could substitute the names of the other columns in:
如果只有这三列,则可以在以下位置替换其他列的名称:
df1<-read.table(header = T, stringsAsFactors = F, text = "
Name Age City Gender Income Company
JXX 21 Chicago M 20K XYZ
CXX 25 NewYork M 30K PQR
CXX 26 Chicago M NA ZZZ")
df2<-read.table(header = T, stringsAsFactors = F, text = "
Age City Gender Avg_Income
21 Chicago M 30K
25 NewYork M 40K
26 Chicago M 50K ")
df1[is.na(df1$Income),]$Income<-df2[is.na(df1$Income),]$Avg_Income
It wouldn't surprise me if one of the regulars has a better way that prevents you from having to re-type the names of the columns. 如果其中一位常规者拥有更好的方法来防止您不得不重新键入列名,这也不会令我感到惊讶。
library(data.table);
## generate data
set.seed(5L);
NK <- 6L; pA <- 0.8; pB <- 0.2;
keydf <- unique(data.frame(Age=sample(18:65,NK,T),City=sample(c('Chicago','NewYork'),NK,T),Gender=sample(c('M','F'),NK,T),stringsAsFactors=F));
NO <- nrow(keydf)-1L;
Af <- cbind(keydf[-1L,],Name=sample(paste0(LETTERS,LETTERS,LETTERS),NO,T),Income=sample(c(NA,paste0(seq(20L,90L,10L),'K')),NO,T,c(pA,rep((1-pA)/8,8L))),stringsAsFactors=F)[sample(seq_len(NO)),];
Bf <- cbind(keydf[-2L,],`Avg Income`=sample(c(NA,paste0(seq(20L,90L,10L),'K')),NO,T,c(pB,rep((1-pB)/8,8L))),stringsAsFactors=F)[sample(seq_len(NO)),];
At <- as.data.table(Af);
Bt <- as.data.table(Bf);
At;
## Age City Gender Name Income
## 1: 50 NewYork F OOO NA
## 2: 23 Chicago M SSS NA
## 3: 62 NewYork M VVV NA
## 4: 51 Chicago F FFF 90K
## 5: 31 Chicago M XXX NA
Bt;
## Age City Gender Avg Income
## 1: 62 NewYork M NA
## 2: 51 Chicago F 60K
## 3: 31 Chicago M 50K
## 4: 27 NewYork M NA
## 5: 23 Chicago M 60K
I generated some random test data for demonstration purposes. 我出于演示目的生成了一些随机测试数据。 I'm quite happy with the result I got with seed 5, which covers many cases:
我对种子5的结果感到非常满意,它涵盖了许多情况:
And I intentionally scrambled the rows of A and B to ensure we join them correctly, regardless of incoming row order. 而且,我故意加扰了A和B的行,以确保无论输入的行顺序如何,我们都可以正确地连接它们。
## data.table solution
keys <- c('Age','City','Gender');
At[is.na(Income),Income:=Bt[.SD,on=keys,`Avg Income`]];
## Age City Gender Name Income
## 1: 50 NewYork F OOO NA
## 2: 23 Chicago M SSS 60K
## 3: 62 NewYork M VVV NA
## 4: 51 Chicago F FFF 90K
## 5: 31 Chicago M XXX 50K
In the above I filter for NA values in A first, then do a join in the j
argument on the key columns and assign in-place the source column to the target column using the data.table :=
syntax. 在上面的代码中,我首先过滤了A中的NA值,然后在键列的
j
参数中进行了连接,然后使用data.table :=
语法将源列适当地分配给目标列。
Note that in the data.table world X[Y]
does a right join , so if you want a left join you need to reverse it to Y[X]
(with "left" now referring to X
, counter-intuitively). 请注意,在data.table世界中,
X[Y]
进行右连接 ,因此,如果要左连接 ,则需要将其反转为Y[X]
(“ left”现在直指X
)。 That's why I used Bt[.SD]
instead of (the likely more natural expectation of) .SD[Bt]
. 这就是为什么我使用
Bt[.SD]
而不是.SD[Bt]
(可能更自然的期望)的.SD[Bt]
。 We need a left join on .SD
because the result of the join index expression will be assigned in-place to the target column, and so the RHS of the assignment must be a full vector correspondent to the target column. 我们需要在
.SD
上进行左连接,因为连接索引表达式的结果将就地分配给目标列,因此分配的RHS必须是与目标列对应的完整向量。
You can repeat the in-place assignment line for each column you want to replace. 您可以为要替换的每一列重复就地分配行。
## base R solution
keys <- c('Age','City','Gender');
m <- merge(cbind(Af[keys],Ai=seq_len(nrow(Af))),cbind(Bf[keys],Bi=seq_len(nrow(Bf))))[c('Ai','Bi')];
m;
## Ai Bi
## 1 2 5
## 2 5 3
## 3 4 2
## 4 3 1
mi <- which(is.na(Af$Income[m$Ai])); Af$Income[m$Ai[mi]] <- Bf$`Avg Income`[m$Bi[mi]];
Af;
## Age City Gender Name Income
## 2 50 NewYork F OOO <NA>
## 5 23 Chicago M SSS 60K
## 3 62 NewYork M VVV <NA>
## 6 51 Chicago F FFF 90K
## 4 31 Chicago M XXX 50K
I guess I was feeling a little bit creative here, so for a base R solution I did something that's probably a little unusual, and which I've never done before. 我想我在这里感觉有点创意,因此对于基本的R解决方案,我做了一些可能不寻常的事情,而我以前从未做过。 I column-bound a synthesized row index column into the key-column subset of each of the A and B data.frames, then called
merge()
to join them (note that this is an inner join , since we don't need any kind of outer join here), and extracted just the row index columns that resulted from the join. 我将一个合成的行索引列绑定到A和B data.frames的每个键列子集中,然后调用
merge()
将它们连接起来(请注意,这是一个内部连接 ,因为我们不需要任何连接)一种外部联接),并仅提取联接产生的行索引列。 This effectively precomputes the joined pairs of rows for all subsequent modification operations. 这可以有效地为所有后续修改操作预先计算连接的行对。
For the modification, I precompute the subset of the join pairs for which the row in A satisfies the replacement condition, eg that its Income
value is NA for the Income
replacement. 对于修改,我预先计算了A中的行满足替换条件的联接对的子集,例如,对于
Income
替换,其Income
值为NA。 We can then subset the join pair table for those rows, and do a direct assignment from B to A to carry out the replacement. 然后,我们可以为这些行的联接对表子集,并从B到A进行直接分配以进行替换。
As before, you can repeat the assignment line for every column you want to replace. 和以前一样,您可以为要替换的每一列重复分配行。
You can simply use the following to update the average income of the city from B to the income in A. 您只需使用以下内容即可将城市的平均收入从B更新为A中的收入。
dataFrameA$Income = dataFrameB$`Avg Income`[match(dataFrameA$City, dataFrameB$City)] dataFrameA $ Income = dataFrameB $`平均收入`[match(dataFrameA $ City,dataFrameB $ City)]
you'll have to use "`" if the column name has a space 如果列名带有空格,则必须使用“`”
this is similar to using a lookup using index and match in excel. 这类似于在Excel中使用索引和匹配进行查找。 I'm assuming you're coming from excel.
我假设您来自Excel。 The code will be more compact if you use data.table
如果使用data.table,代码将更加紧凑。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.