简体   繁体   中英

Combining two columns in a data frame and creating a new column in an existing data frame in R

I am trying to combine information from multiple data frames. Suppose I have a master data frame called "master" regarding information about a bunch of cars. The Master data frame has ALL IDs possible. Then suppose I have 2 other data frames, each with SOME of the ids and the color of the cars but not all. I want to put all of the colors of the cars, corresponding to their IDs in the existing Master data frame.

Master df looks like this:

 id     year
 4D     2005
 5A     2003    
 7T     1999
 8Q     2001
 3G     1998
 6N     2009
 2B     2013

df1 looks like this:

id     color
5A     black
7T     green
8Q     gold
4D     white

df2 looks like this:

id     color
5A     NA
6N     purple
3G     NA
2B     orange

Here's the code to make the 4 data frames:

master <- data.frame(id = c("4D", "5A", "7T", "8Q", "3G", "6N", "2B"), year=c(2005,2003,1999,2001,1998,2009,2013))
df1 <- data.frame(id=c("5A", "7T", "8Q", "4D"), color=c("black", "green", "gold", "white"))
df2 <- data.frame(id=c("5A", "6N", "3G", "2B"), color=c(NA,"purple", NA, "orange"))

I want the final master data frame to look like this:

 id     year     color
 4D     2005     white
 5A     2003     black    
 7T     1999     green
 8Q     2001     gold
 3G     1998     NA
 6N     2009     purple
 2B     2013     orange

If there is value in one data frame and an NA in the other data frame for the same ID (ie: ID 5A has "black" in df1 but NA in df2), the value should be present in the spreadsheet.

Any thoughts? I've looked at the merge and stack functions but I don't think these are the ways to go about it?

df <- Master
df1
df2

dft <- rbind(df1, df2)

df3 <- dft[with(dft, order(id, color)), ]
df3 <- df3[!duplicated(df3$color), ]
merge(df, df3, by = 'id', all.x = T)

  id year  color
1 2B 2013 orange
2 3G 1998   <NA>
3 4D 2005  white
4 5A 2003  black
5 6N 2009 purple
6 7T 1999  green
7 8Q 2001   gold

Here is another option.
1, My approach is to combine df1 and df2 first then remove those color is NA using complete.cases
2, Then merge master with the new df3. ( all=TRUE , ensure those id not included in df3 will shown up in the combined data.frame)

df3 <- rbind(df1,df2)                      #row combine for df1 and df2
df3 <- df3[complete.cases(df3),]           #remove row with NA
merge(master, df3, by="id", all=TRUE)      #merge master and df3 based on id

Result is

  id year  color
1 2B 2013 orange
2 3G 1998   <NA>
3 4D 2005  white
4 5A 2003  black
5 6N 2009 purple
6 7T 1999  green
7 8Q 2001   gold

Cheers.

You could also use the sqldf package once you combine your df1 and df2 data frames by row:

*Assumes the same input from your example above (eg values for master , df1 , df2 )

> id_color <- rbind(df1, df2)

> library(sqldf)

> data <- sqldf("SELECT m.*, i.color FROM master m, id_color i WHERE m.id = i.id")

> data
  id year  color
1 4D 2005  white
2 5A 2003   <NA>
3 5A 2003  black
4 7T 1999  green
5 8Q 2001   gold
6 3G 1998   <NA>
7 6N 2009 purple
8 2B 2013 orange

> data <- sqldf("SELECT id, year, MIN(color) FROM data GROUP BY id, year") #removes duplicate with NULL

> data
  id year MIN(color)
1 2B 2013     orange
2 3G 1998       <NA>
3 4D 2005      white
4 5A 2003      black
5 6N 2009     purple
6 7T 1999      green
7 8Q 2001       gold

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM