简体   繁体   中英

Merge data sets based on two column values

I have looked everywhere and I cannot seem to find a workable solution to this small problem I am having.

I have two large data sets, N=875 and N=922.

df.1 data set with 875 obs and 27 var df.2 data set with 922 obs 23 var

df.1 has columns FIRST and LAST which are the first and last names of the individuals, and 25 other columns.

df.2 has columns X1 and X2 which correspond to the first and last names of the individuals, and 21 other columns.

I need to merge df.1 with df.2 and throw away any observations that are not in both frames.

So I should now have a data frame with less that 875 observations and 48 columns.

Any suggestions?

Thanks

Considering the the variables names are exactly the same for df1 and df2, you could try

merge(df1, df2, by = c('X1', 'X2'), all = F)

In your case, you will need

merge(df.1, df.2, by.x = c('FIRST', 'LAST'), by.y = c('X1', 'X2'))

For the data set provided:

library(XML)
url1 <- "http://stats.nhlnumbers.com/player_stats/year/2010"
df1 <- readHTMLTable(url1)
names <- data.frame(do.call(rbind, strsplit(as.character(df1[[1]][ ,1]),
                                            split = ", ")))
df1 <- cbind(df1, names)
#head(df1)

url2 <- "http://stats.nhlnumbers.com/player_stats/year/2009"
df2 <- readHTMLTable(url2)
names2 <- data.frame(do.call(rbind, strsplit(as.character(df2[[1]][ ,1]),
                                             split = ", ")))
df2 <- cbind(df2, names2)
#head(df2)

df1_2 <- merge(df1, df2, by = c('X1', 'X2'), all = F)
head(subset(df1_2, select = c('X1', 'X2', 'skaters-data.Name.x',
                              'NULL.PTS/$MM.x', 'NULL.PTS/$MM.y')))
df1_2$Player <- paste(df1_2$X2, df1_2$X1)
url3 <- "http://hockey-reference.com/leagues/NHL_2010_skaters.html"
df3 <- readHTMLTable(url3)
df3 <- as.data.frame(df3)

df4 <- merge(df1_2, df3, by.x = 'Player', by.y = 'stats.Player', all = F)
names(df4) <- gsub("[[:punct:]]", "_", names(df4))
head(subset(df4, select = c(X2, X1, Player, NULL_PTS__MM_x,
                            NULL_PTS__MM_y, stats_Rk)))

     X2      X1        Player NULL_PTS__MM_x NULL_PTS__MM_y stats_Rk
1 Aaron Johnson Aaron Johnson         18.519         15.573      344
2 Aaron    Rome    Aaron Rome          7.619          6.698      662
3 Aaron   Voros   Aaron Voros          7.000         16.000      825
4 Aaron    Ward    Aaron Ward          5.200          4.000      834
5  Adam  Burish   Adam Burish          5.614         12.632       95
6  Adam   Foote    Adam Foote          3.000          2.333      228

And maybe plot it

df5 <- subset(df4, select = c(X2, X1, Player, NULL_PTS__MM_x,
                              NULL_PTS__MM_y, stats_Rk))[1:10, ]

library(ggplot2)
ggplot(aes(x = as.numeric(NULL_PTS__MM_x), y = as.numeric(stats_Rk),
       colour = Player), data = df5) +
  geom_point()

绘制它

Besides base::merge , one alternative is using SQL .

You can use that in R with sqldf (but rename your data to df1 and df2 without dots . )

library(sqldf )
sqldf("SELECT *
      FROM df1, df2
      WHERE df1.FIRST==df2.X1
      and df1.LAST == df2.X2")

Another alternative is data.table , if you have big data sets you should consider this:

library(data.table)
dt1 <- data.table(df1, key=c("FIRST", "LAST")) #set keys
dt2 <- data.table(df2, key=c("X1", "X2")) #set keys
dt1[dt2] #join

Starting from data.table versions >= 1.9, there's a function setDT that converts a data.frame (and also a list ) to data.table by reference. That'll make things much more faster and memory efficient (especially in cases where your data is 5GB and you've 8GB RAM). So, it can be done as:

require(data.table) # >= 1.9
setDT(df1) # df1 will be a data.table
setDT(df2) # df2 will be a data.table
setkey(df1, FIRST, LAST)
setkey(df2, X1, X2)
df1[df2]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM