简体   繁体   中英

In R, how can I add some specific columns from a dataframe to another dataframe when some values are equal in both dataframes?

I have two datasets which have both the same row combinations Country & Year and I would like to add some columns from one dataset to the other one in a way that the row combinations match.

Dataset 1:

+----------+------+---------+---------+-----+
| Country  | Year | exports | imports | ... |
+----------+------+---------+---------+-----+
| Germany  | 2000 | 0.70    | 0.40    | ... |
| Germany  | 2001 | 0.68    | 0.41    | ... |
| Germany  | 2002 | 0.71    | 0.48    | ... |
| Germany  | 2003 | ...     | ...     | ... |
| Spain    | 2000 | 0.51    | 0.56    | ... |
| Spain    | 2001 | 0.48    | 0.50    | ... |
| Spain    | 2002 | 0.50    | 0.53    | ... |
| Spain    | 2003 | ...     | ...     | ... |
| ...      | ...  | ...     | ...     | ... |
+----------+------+---------+---------+-----+

Dataset 2:

+----------+-----+------+--------------+-------+-----+
| Country  | CC  | Year | unemployment | Pop   | ... |
+----------+-----+------+--------------+-------+-----+
| Germany  | GER | 2000 | 0.03         | 79.50 | ... |
| Germany  | GER | 2001 | 0.05         | 79.53 | ... |
| Germany  | GER | 2002 | 0.04         | 79.80 | ... |
| Germany  | GER | 2003 | ...          | ...   | ... |
| Hungary  | HUN | 2000 | ...          | ...   | ... |
| Hungary  | HUN | 2001 | ...          | ...   | ... |
| Hungary  | HUN | 2002 | ...          | ...   | ... |
| Hungary  | HUN | 2003 | ...          | ...   | ... |
| Spain    | ESP | 2000 | 0.08         | 40.2  | ... |
| Spain    | ESP | 2001 | 0.11         | 40.5  | ... |
| Spain    | ESP | 2002 | 0.10         | 40.55 | ... |
| Spain    | ESP | 2003 | ...          | ...   | ... |
| ...      | ... | ...  | ...          | ...   | ... |
+----------+-----+------+--------------+-------+-----+

I want the merged data to look like this:


+----------+-----+------+---------+---------+--------------+-------+-----+
| Country  | CC  | Year | exports | imports | unemployment | Pop   | ... |
+----------+-----+------+---------+---------+--------------+-------+-----+
| Germany  | GER | 2000 | 0.70    | 0.40    | 0.03         | 79.50 | ... |
| Germany  | GER | 2001 | 0.68    | 0.41    | 0.05         | 79.53 | ... |
| Germany  | GER | 2002 | 0.71    | 0.48    | 0.04         | 79.80 | ... |
| Germany  | GER | 2003 | ...     | ...     | ...          | ...   | ... |
| Spain    | ESP | 2000 | 0.51    | 0.56    | 0.08         | 40.2  | ... |
| Spain    | ESP | 2001 | 0.48    | 0.50    | 0.11         | 40.5  | ... |
| Spain    | ESP | 2002 | 0.50    | 0.53    | 0.10         | 40.55 | ... |
| Spain    | ESP | 2003 | ...     | ...     | ...          | ...   | ... |
| ...      | ... | ...  | ...     | ...     | ...          | ...   | ... |
+----------+-----+------+---------+---------+--------------+-------+-----+

So, the countries which are not in dataset 1 (like Hungary in this case) are not in the merged dataset and the country code is also in the new dataset. Could someone tell me how I can achieve this? I have 28 years for about 100 countries each. So using a function in which I have to specify every combination would not be handy...

I tried to merge it with merge() but did not succeed since it just created hundreds of rows with the same country and year combination.

merge absolutely should work for this. You should specify that you are merging on two columns.

merge( df1 , df2 , by=c( "Country", "Year") )

Also confirm that the class of the merging vars is the same

sapply( df1[, c( "Country", "Year")] , class )
sapply( df2[, c( "Country", "Year")] , class )

confirm that the variables are spelled the same way in both data frames

intersect( names( df1 ) , names( df2 ))

Finally confirm that year and country are unique in both data.frames

sum( duplicated( df1[ ,c( "Country", "Year") ] ))
sum( duplicated( df2[ ,c( "Country", "Year") ] ))

您可以使用dplyr包中的inner_join()执行此dplyr

dplyr::inner_join(df1, df2, by=c("Country", "Year"))

The answer with merge() worked! Now I am facing the problem that eg Spain does not have any unemployment data for the year 2000. However, I still want to add all years of Spain and would like to have a NA in the unemployment column for Spain in 2000 in the merged dataset. How can I achieve this?

I tried to use merge(df1, df2, all.x = TRUE) but sometimes it just creates NA's for some reason...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM