[英]How to join two data frames by group?
我有一個數據框(DF),我在每個CompanyID上有2006年和2007年在那里工作的Director以及2個關於它們的信息(性別和年齡)。
DF <-
CompanyID Name Country ISIN Director_2006 Gender_2006 Yearold_2006 Director_2007 Gender_2007 Yearold_2007
25830 BANKxxx Austria AT000504 11734844255 M 54 11734844255 M 55
25830 BANKxxx Austria AT000504 187836811559 F 45 5524344997 F NA
25830 BANKxxx Austria AT000504 5524344997 F NA 5524354997 M 39
25830 BANKxxx Austria AT000504 5524354997 M 38 5742347684 M 38
25830 BANKxxx Austria AT000504 6613115791 M 41 40160443378 M 30
12339 BANKyyy Belgium AT034003 9855321789 M 44 9855321789 M 45
12339 BANKyyy Belgium AT034003 277520199 M NA 23779351 F 34
我有第二個數據框(DF2),我在每個DirectorID(fisrt列)中有不同年份的經驗年限(第三列)(第二列)。
DF2 <-
DirectorID Year YearsExperience
11734844255 2006 0.4
11734844255 2007 1.4
187836811559 2006 1.5
5524344997 2006 2.4
5524344997 2007 3.4
5524354997 2006 1.8
5524354997 2007 2.8
5742347684 2007 3.5
40160443378 2007 4.3
9855321789 2005 2.6
9855321789 2006 3.6
9855321789 2007 4.6
277520199 2006 1.6
23779351 2007 3.2
55443322 2005 2.5
55443322 2006 3.5
我想加入這兩個數據框的信息,在兩年(2006年和2007年)創建一個新的專欄,其中包含每個公司的每位董事的經驗年數,即專欄Experience_2006和Experience_2007。
因此,我的預期輸出看起來像:
DF_final <-
CompanyID Name Country ISIN Director_2006 Gender_2006 YearBirth_2006 Experience_2006 Director_2007 Gender_2007 YearBirth_2007 Experience_2007
25830 BANKxxx Austria AT000504 11734844255 M 54 0.4 11734844255 M 55 1.4
25830 BANKxxx Austria AT000504 187836811559 F 45 1.5 5524344997 F NA 3.4
25830 BANKxxx Austria AT000504 5524344997 F NA 2.4 5524354997 M 39 2.8
25830 BANKxxx Austria AT000504 5524354997 M 38 1.8 5742347684 M 38 3.5
25830 BANKxxx Austria AT000504 6613115791 M 41 NA 40160443378 M 30 4.3
12339 BANKyyy Belgium AT034003 9855321789 M 44 3.6 9855321789 M 45 4.6
12339 BANKyyy Belgium AT034003 277520199 M NA 1.6 23779351 F 34 3.2
拜托,有人可以告訴我嗎? 謝謝。
數據
DF <- read.table(text =
"CompanyID Name Country ISIN Director_2006 Gender_2006 YearBirth_2006 Director_2007 Gender_2007 YearBirth_2007
25830 BANKxxx Austria AT000504 11734844255 M 54 11734844255 M 55
25830 BANKxxx Austria AT000504 187836811559 F 45 5524344997 F NA
25830 BANKxxx Austria AT000504 5524344997 F NA 5524354997 M 39
25830 BANKxxx Austria AT000504 5524354997 M 38 5742347684 M 38
25830 BANKxxx Austria AT000504 6613115791 M 41 40160443378 M 30
12339 BANKyyy Belgium AT034003 9855321789 M 44 9855321789 M 45
12339 BANKyyy Belgium AT034003 277520199 M NA 23779351 F 34",
header = T, stringsAsFactors = F)
DF2 <- read.table(text =
"DirectorID Year YearsExperience
11734844255 2006 0.4
11734844255 2007 1.4
187836811559 2006 1.5
5524344997 2006 2.4
5524344997 2007 3.4
5524354997 2006 1.8
5524354997 2007 2.8
5742347684 2007 3.5
40160443378 2007 4.3
9855321789 2005 2.6
9855321789 2006 3.6
9855321789 2007 4.6
277520199 2006 1.6
23779351 2007 3.2
55443322 2005 2.5
55443322 2006 3.5",
header = T, stringsAsFactors = F)
為了完成,我使用了dplyr
和'tidyr'並與其他函數進行了基准測試。
更新:我沒有使用過濾器並選擇函數myfun4()
我做了另一個版本的@Jimbou的答案。 這是我的基准測試中加速最快的。 拉爾夫的答案現在排在第二位。 我的初始版本( myfun3()
)排在第三位。
microbenchmark::microbenchmark(myfun1(),myfun2(),myfun3(),myfun4())
Unit: milliseconds
expr min lq mean median uq max neval
myfun1() 23.1527 28.36865 31.322275 31.53225 33.69430 52.7319 100
myfun2() 5.2549 5.78445 8.241408 8.25995 9.63870 14.4018 100
myfun3() 7.9534 10.15115 11.976498 11.40415 13.66255 20.9362 100
myfun4() 2.9676 3.40105 5.032863 4.87115 5.56065 19.0217 100
功能代碼:
myfun4<-function(){
colnames(DF2)[1]='Director_2007'
DF_final<-left_join(DF,DF2[DF2$Year==2006,-2],by='Director_2007') %>%
left_join(DF2[DF2$Year==2007,-2],by='Director_2007')
n=dim(DF_final)[2]
colnames(DF_final)[(n-1):n]=paste0('YearsExperience_',2006:2007)
}
myfun3<-function(){
DF2_spread<-tidyr::spread(DF2,Year,YearsExperience)[,-2]
colnames(DF2_spread)=c('Director_2007',paste0('Experience_',colnames(df2_spread)[2:3]))
DF_final<-dplyr::left_join(DF,DF2_spread,by='Director_2007')
}
myfun2<-function() {
DF1 <- reshape(DF, direction = "long", varying = names(DF)[5:10], sep = "_", timevar = "Year")
DF3 <- merge(DF1, DF2, all.x = TRUE, by.x = c("Director" , "Year"), by.y = c("DirectorID", "Year"))
DF_final<-reshape(DF3, direction = "wide", v.names = names(DF3)[c(1,7,8,10)], timevar = "Year", sep = "_")
}
myfun1<-function(){
DF %>%
left_join(DF2 %>%
filter(Year == 2006) %>%
select(DirectorID,YearsExperience_2016=YearsExperience),
by=c("Director_2006" = "DirectorID")) %>%
left_join(DF2 %>%
filter(Year == 2007) %>%
select(DirectorID,YearsExperience_2017=YearsExperience),
by=c("Director_2007" = "DirectorID"))
}
你可以試試
library(tidyverse)
DF %>%
left_join(DF2 %>%
filter(Year == 2006) %>%
select(DirectorID,YearsExperience_2016=YearsExperience),
by=c("Director_2006" = "DirectorID")) %>%
left_join(DF2 %>%
filter(Year == 2007) %>%
select(DirectorID,YearsExperience_2017=YearsExperience),
by=c("Director_2007" = "DirectorID"))
CompanyID Name Country ISIN Director_2006 Gender_2006 YearBirth_2006 Director_2007 Gender_2007
1 25830 BANKxxx Austria AT000504 11734844255 M 54 11734844255 M
2 25830 BANKxxx Austria AT000504 187836811559 F 45 5524344997 F
3 25830 BANKxxx Austria AT000504 5524344997 F NA 5524354997 M
4 25830 BANKxxx Austria AT000504 5524354997 M 38 5742347684 M
5 25830 BANKxxx Austria AT000504 6613115791 M 41 40160443378 M
6 12339 BANKyyy Belgium AT034003 9855321789 M 44 9855321789 M
7 12339 BANKyyy Belgium AT034003 277520199 M NA 23779351 F
YearBirth_2007 YearsExperience_2016 YearsExperience_2017
1 55 0.4 1.4
2 NA 1.5 3.4
3 39 2.4 2.8
4 38 1.8 3.5
5 30 NA 4.3
6 45 3.6 4.6
7 34 1.6 3.2
使用基本R功能:
DF1 <- reshape(DF, direction = "long", varying = names(DF)[5:10], sep = "_", timevar = "Year")
DF3 <- merge(DF1, DF2, all.x = TRUE, by.x = c("Director" , "Year"), by.y = c("DirectorID", "Year"))
reshape(DF3, direction = "wide", v.names = names(DF3)[c(1,7,8,10)], timevar = "Year", sep = "_")
#> CompanyID Name Country ISIN id Director_2007 Gender_2007
#> 1 12339 BANKyyy Belgium AT034003 7 23779351 F
#> 3 25830 BANKxxx Austria AT000504 3 5524354997 M
#> 4 25830 BANKxxx Austria AT000504 2 5524344997 F
#> 5 25830 BANKxxx Austria AT000504 4 5742347684 M
#> 8 25830 BANKxxx Austria AT000504 5 40160443378 M
#> 9 12339 BANKyyy Belgium AT034003 6 9855321789 M
#> 11 25830 BANKxxx Austria AT000504 1 11734844255 M
#> YearBirth_2007 YearsExperience_2007 Director_2006 Gender_2006
#> 1 34 3.2 277520199 M
#> 3 39 2.8 5524344997 F
#> 4 NA 3.4 187836811559 F
#> 5 38 3.5 5524354997 M
#> 8 30 4.3 6613115791 M
#> 9 45 4.6 9855321789 M
#> 11 55 1.4 11734844255 M
#> YearBirth_2006 YearsExperience_2006
#> 1 NA 1.6
#> 3 NA 2.4
#> 4 45 1.5
#> 5 38 1.8
#> 8 41 NA
#> 9 44 3.6
#> 11 54 0.4
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.