[英]In R, how can I add some specific columns from a dataframe to another dataframe when some values are equal in both dataframes?
I have two datasets which have both the same row combinations Country & Year and I would like to add some columns from one dataset to the other one in a way that the row combinations match. 我有两个数据集,它们具有相同的行组合Country和Year,我想以一种行组合匹配的方式将一个数据集中的一些列添加到另一个数据集中。
Dataset 1: 数据集1:
+----------+------+---------+---------+-----+
| Country | Year | exports | imports | ... |
+----------+------+---------+---------+-----+
| Germany | 2000 | 0.70 | 0.40 | ... |
| Germany | 2001 | 0.68 | 0.41 | ... |
| Germany | 2002 | 0.71 | 0.48 | ... |
| Germany | 2003 | ... | ... | ... |
| Spain | 2000 | 0.51 | 0.56 | ... |
| Spain | 2001 | 0.48 | 0.50 | ... |
| Spain | 2002 | 0.50 | 0.53 | ... |
| Spain | 2003 | ... | ... | ... |
| ... | ... | ... | ... | ... |
+----------+------+---------+---------+-----+
Dataset 2: 数据集2:
+----------+-----+------+--------------+-------+-----+
| Country | CC | Year | unemployment | Pop | ... |
+----------+-----+------+--------------+-------+-----+
| Germany | GER | 2000 | 0.03 | 79.50 | ... |
| Germany | GER | 2001 | 0.05 | 79.53 | ... |
| Germany | GER | 2002 | 0.04 | 79.80 | ... |
| Germany | GER | 2003 | ... | ... | ... |
| Hungary | HUN | 2000 | ... | ... | ... |
| Hungary | HUN | 2001 | ... | ... | ... |
| Hungary | HUN | 2002 | ... | ... | ... |
| Hungary | HUN | 2003 | ... | ... | ... |
| Spain | ESP | 2000 | 0.08 | 40.2 | ... |
| Spain | ESP | 2001 | 0.11 | 40.5 | ... |
| Spain | ESP | 2002 | 0.10 | 40.55 | ... |
| Spain | ESP | 2003 | ... | ... | ... |
| ... | ... | ... | ... | ... | ... |
+----------+-----+------+--------------+-------+-----+
I want the merged data to look like this: 我希望合并的数据看起来像这样:
+----------+-----+------+---------+---------+--------------+-------+-----+
| Country | CC | Year | exports | imports | unemployment | Pop | ... |
+----------+-----+------+---------+---------+--------------+-------+-----+
| Germany | GER | 2000 | 0.70 | 0.40 | 0.03 | 79.50 | ... |
| Germany | GER | 2001 | 0.68 | 0.41 | 0.05 | 79.53 | ... |
| Germany | GER | 2002 | 0.71 | 0.48 | 0.04 | 79.80 | ... |
| Germany | GER | 2003 | ... | ... | ... | ... | ... |
| Spain | ESP | 2000 | 0.51 | 0.56 | 0.08 | 40.2 | ... |
| Spain | ESP | 2001 | 0.48 | 0.50 | 0.11 | 40.5 | ... |
| Spain | ESP | 2002 | 0.50 | 0.53 | 0.10 | 40.55 | ... |
| Spain | ESP | 2003 | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | ... | ... | ... |
+----------+-----+------+---------+---------+--------------+-------+-----+
So, the countries which are not in dataset 1 (like Hungary in this case) are not in the merged dataset and the country code is also in the new dataset. 因此,不在数据集1中的国家(在本例中为匈牙利)不在合并数据集中,国家/地区代码也在新数据集中。 Could someone tell me how I can achieve this?
有人能告诉我如何实现这一目标吗? I have 28 years for about 100 countries each.
我有28年,每个约100个国家。 So using a function in which I have to specify every combination would not be handy...
因此,使用我必须指定每个组合的功能将不方便...
I tried to merge it with merge()
but did not succeed since it just created hundreds of rows with the same country and year combination. 我试图将它与
merge()
合并,但没有成功,因为它只创建了数百个具有相同国家和年份组合的行。
merge absolutely should work for this. 合并绝对应该为此工作。 You should specify that you are merging on two columns.
您应该指定要合并两列。
merge( df1 , df2 , by=c( "Country", "Year") )
Also confirm that the class of the merging vars is the same 同时确认合并变量的类是相同的
sapply( df1[, c( "Country", "Year")] , class )
sapply( df2[, c( "Country", "Year")] , class )
confirm that the variables are spelled the same way in both data frames 确认两个数据框中的变量拼写方式相同
intersect( names( df1 ) , names( df2 ))
Finally confirm that year and country are unique in both data.frames 最后确认年份和国家在两个data.frames中都是唯一的
sum( duplicated( df1[ ,c( "Country", "Year") ] ))
sum( duplicated( df2[ ,c( "Country", "Year") ] ))
您可以使用dplyr
包中的inner_join()
执行此dplyr
dplyr::inner_join(df1, df2, by=c("Country", "Year"))
The answer with merge()
worked! merge()
的答案有效! Now I am facing the problem that eg Spain does not have any unemployment data for the year 2000. However, I still want to add all years of Spain and would like to have a NA in the unemployment column for Spain in 2000 in the merged dataset. 现在我面临的问题是,例如西班牙2000年没有任何失业数据。但是,我仍然希望增加西班牙的所有年份,并希望在2000年的合并数据集中为西班牙的失业栏增加一个NA 。 How can I achieve this?
我怎样才能做到这一点?
I tried to use merge(df1, df2, all.x = TRUE)
but sometimes it just creates NA's for some reason... 我尝试使用
merge(df1, df2, all.x = TRUE)
但有时它只是因某种原因创建了NA ...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.