简体   繁体   English

匹配R中两个不同数据帧的列

[英]matching columns of two different data frames in R

I have two data frames with longitude and latitude values, and I would like to extract values from data frame #2 (say column df2$C , third column of the data frame #2) which value match columns of data frame 1... for example, data frame 1 has two columns ( lon , lat ), and data frame 2 has three columns ( lon , lat , and some value "C" )... I want to add a third column to data frame 1, in which those values of df2$C correspond to those values that are an exact match of BOTH columns in both data frames, something like df1$lon == df2$lon AND df1$lat == df2$lat ... and in lat , lon pairs that doesn't match, I would like to add a NA , so that the third column (that I want to add to data. frame 1) has a length that is = nrow(df1) . 我有两个具有经度和纬度值的数据帧,我想从数据帧2中提取值(例如,列df2$C ,数据帧2的第三列),这些值与数据帧1的列匹配。例如,数据帧1有两列( lonlat ),数据帧2有三列( lonlat和一些值"C" )...我想在数据帧1中添加第三列df2$C那些值对应于两个数据帧中BOTH列的完全匹配的值,例如df1$lon == df2$lon AND df1$lat == df2$lat ...,在lat中不匹配的lon对,我想添加一个NA ,以便第三列(我想添加到数据。第1帧)的长度为= nrow(df1) I tried the merge function, but I'm having troubles matching both columns of df1 to those of df2 . 我尝试了合并功能,但在将df1两列与df2两列匹配时遇到了麻烦。

You could try data.table 您可以尝试data.table

library(data.table)
setDT(df1)
setkey(setDT(df2), lat, lon)
df2[df1]
#   lat lon          C
#1:  58   1         NA
#2:  52  10         NA
#3:  54   7 -0.9094088
#4:  60   2         NA
#5:  50   3  1.4541841
#6:  56   9 -1.7771135
#7:  59   5         NA
#8:  55   8         NA
#9:  53   4         NA
#10: 57   6         NA

data 数据

df1 <- structure(list(lat = c(58L, 52L, 54L, 60L, 50L, 56L, 59L, 55L, 
53L, 57L), lon = c(1L, 10L, 7L, 2L, 3L, 9L, 5L, 8L, 4L, 6L)), .Names = c("lat", 
"lon"), row.names = c(NA, -10L), class = "data.frame")

df2 <- structure(list(lat = c(51L, 55L, 50L, 58L, 56L, 57L, 60L, 54L, 
 52L, 54L), lon = c(13L, 10L, 3L, 6L, 9L, 8L, 9L, 16L, 4L, 7L), 
 C = c(1.48642005012902, 1.53314455225747, 1.45418413640182, 
-0.874122129771392, -1.77711353745745, 0.128866710402714, 
-2.41118134931725, -1.78305563078752, -0.0173287724390305, 
-0.909408846416724)), .Names = c("lat", "lon", "C"), row.names = c(NA, 
-10L), class = "data.frame")

Since these are geocodes, one thing to watch out for is that the fields have to match exactly. 由于这些是地理编码,因此需要注意的一件事是字段必须完全匹配。 So for instance if one dataset has lon/lat to 6 significant figures, and the other has lon/lat to 8 significant figures, you will get no matches (or very few). 因此,例如,如果一个数据集具有lon / lat到6个有效数字,而另一个数据集具有lon / lat到8个有效数字,则将没有匹配项(或很少)。 I wonder if this is why merge(...) isn't working for you. 我不知道这是为什么merge(...)对您不起作用。 As shown below, it should work. 如下所示,它应该可以工作。

merge(...) should work, especially if both data frames have the same column names. merge(...)应该可以正常工作,尤其是当两个数据框具有相同的列名时。 Using the datasets from @akrun's answer: 使用@akrun答案中的数据集:

merge(df1,df2, by=c("lon","lat"),all.x=TRUE)
#    lon lat          C
# 1    1  58         NA
# 2    2  60         NA
# 3    3  50  1.4541841
# 4    4  53         NA
# 5    5  59         NA
# 6    6  57         NA
# 7    7  54 -0.9094088
# 8    8  55         NA
# 9    9  56 -1.7771135
# 10  10  52         NA

If you don't specify the by=... argument, merge(...) will use all common columns , so in this case you could just write: 如果不指定by=...参数,则merge(...)将使用所有公共列 ,因此在这种情况下,您可以编写:

merge(df1,df2,all.x=TRUE)

You could also use join(...) is the plyr package. 您还可以使用join(...)plyr软件包。

library(plyr)
join(df1,df2)

All of these options produce the same result, although the rows are in different order. 所有这些选项都产生相同的结果,尽管行的顺序不同。

The data.table approach will be fastest, although without a really large dataset (>1e5 rows) you might not notice the difference. 尽管没有非常大的数据集(> 1e5行),您可能不会注意到其中的区别,但data.table方法将是最快的。

You can use ifelse for this. 您可以为此使用ifelse For example, with the data: 例如,使用数据:

df1 <- structure(list(lat = c(58L, 52L, 54L, 60L, 50L, 56L, 59L, 55L, 
                              53L, 57L), lon = c(1L, 10L, 7L, 2L, 3L, 9L, 5L, 8L, 4L, 6L)), .Names = c("lat", 
                                                                                                       "lon"), row.names = c(NA, -10L), class = "data.frame")

df2 <- structure(list(lat = c(51L, 55L, 50L, 58L, 56L, 57L, 60L, 54L, 
                              52L, 54L), lon = c(13L, 10L, 3L, 6L, 9L, 8L, 9L, 16L, 4L, 7L), 
                      C = c(1.48642005012902, 1.53314455225747, 1.45418413640182, 
                            -0.874122129771392, -1.77711353745745, 0.128866710402714, 
                            -2.41118134931725, -1.78305563078752, -0.0173287724390305, 
                            -0.909408846416724)), .Names = c("lat", "lon", "C"), row.names = c(NA, 
                                                                                               -10L), class = "data.frame")

You can create column C for df1 with 您可以使用以下命令为df1创建列C

ifelse(df1[,'lat'] %in% df2[,'lat'] & df1[,'lon'] %in% df2[,'lon'],df2$C,NA)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM