I have looked everywhere and I cannot seem to find a workable solution to this small problem I am having.
I have two large data sets, N=875 and N=922.
df.1
data set with 875 obs and 27 var df.2
data set with 922 obs 23 var
df.1
has columns FIRST
and LAST
which are the first and last names of the individuals, and 25 other columns.
df.2
has columns X1
and X2
which correspond to the first and last names of the individuals, and 21 other columns.
I need to merge df.1
with df.2
and throw away any observations that are not in both frames.
So I should now have a data frame with less that 875 observations and 48 columns.
Any suggestions?
Thanks
Considering the the variables names are exactly the same for df1 and df2, you could try
merge(df1, df2, by = c('X1', 'X2'), all = F)
In your case, you will need
merge(df.1, df.2, by.x = c('FIRST', 'LAST'), by.y = c('X1', 'X2'))
For the data set provided:
library(XML)
url1 <- "http://stats.nhlnumbers.com/player_stats/year/2010"
df1 <- readHTMLTable(url1)
names <- data.frame(do.call(rbind, strsplit(as.character(df1[[1]][ ,1]),
split = ", ")))
df1 <- cbind(df1, names)
#head(df1)
url2 <- "http://stats.nhlnumbers.com/player_stats/year/2009"
df2 <- readHTMLTable(url2)
names2 <- data.frame(do.call(rbind, strsplit(as.character(df2[[1]][ ,1]),
split = ", ")))
df2 <- cbind(df2, names2)
#head(df2)
df1_2 <- merge(df1, df2, by = c('X1', 'X2'), all = F)
head(subset(df1_2, select = c('X1', 'X2', 'skaters-data.Name.x',
'NULL.PTS/$MM.x', 'NULL.PTS/$MM.y')))
df1_2$Player <- paste(df1_2$X2, df1_2$X1)
url3 <- "http://hockey-reference.com/leagues/NHL_2010_skaters.html"
df3 <- readHTMLTable(url3)
df3 <- as.data.frame(df3)
df4 <- merge(df1_2, df3, by.x = 'Player', by.y = 'stats.Player', all = F)
names(df4) <- gsub("[[:punct:]]", "_", names(df4))
head(subset(df4, select = c(X2, X1, Player, NULL_PTS__MM_x,
NULL_PTS__MM_y, stats_Rk)))
X2 X1 Player NULL_PTS__MM_x NULL_PTS__MM_y stats_Rk
1 Aaron Johnson Aaron Johnson 18.519 15.573 344
2 Aaron Rome Aaron Rome 7.619 6.698 662
3 Aaron Voros Aaron Voros 7.000 16.000 825
4 Aaron Ward Aaron Ward 5.200 4.000 834
5 Adam Burish Adam Burish 5.614 12.632 95
6 Adam Foote Adam Foote 3.000 2.333 228
And maybe plot it
df5 <- subset(df4, select = c(X2, X1, Player, NULL_PTS__MM_x,
NULL_PTS__MM_y, stats_Rk))[1:10, ]
library(ggplot2)
ggplot(aes(x = as.numeric(NULL_PTS__MM_x), y = as.numeric(stats_Rk),
colour = Player), data = df5) +
geom_point()
Besides base::merge
, one alternative is using SQL
.
You can use that in R
with sqldf
(but rename your data to df1
and df2
without dots .
)
library(sqldf )
sqldf("SELECT *
FROM df1, df2
WHERE df1.FIRST==df2.X1
and df1.LAST == df2.X2")
Another alternative is data.table
, if you have big data sets you should consider this:
library(data.table)
dt1 <- data.table(df1, key=c("FIRST", "LAST")) #set keys
dt2 <- data.table(df2, key=c("X1", "X2")) #set keys
dt1[dt2] #join
Starting from data.table
versions >= 1.9, there's a function setDT
that converts a data.frame
(and also a list
) to data.table
by reference. That'll make things much more faster and memory efficient (especially in cases where your data is 5GB and you've 8GB RAM). So, it can be done as:
require(data.table) # >= 1.9
setDT(df1) # df1 will be a data.table
setDT(df2) # df2 will be a data.table
setkey(df1, FIRST, LAST)
setkey(df2, X1, X2)
df1[df2]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.