简体   繁体   English

在数据框中添加新列,计算来自另一个数据帧的行

[英]Adding new column in dataframe counting rows from another dataframe

Have searched on the forums for a solution but couldn't find one. 已经在论坛上搜索了解决方案,但找不到解决方案。

I have data on companies' financials in one dataframe (df1) and data on acquisitions made in another dataframe (df2). 我有一个数据框中的公司财务数据(df1)和另一个数据框(df2)中的收购数据。 The data is in the same format as below. 数据格式如下。

df1 <- data.frame(ID=c('111111','111111', '222222', '333333', '444444'),
              year=c(2010, 2011, 2010, 2011, 2011))
df2 <- data.frame(ID=c('111111', '111111', '111111', '111111', '333333'),
              year=c(2010,2010,2010,2011,2011))´

My goal is to create a new column in df1 with values that count each observation in df2 that matches both the ID and the year of the row in df1. 我的目标是在df1中创建一个新列,其值可以计算df2中每个与df1中的行和年份相匹配的观察值。 I need a variable that counts the number of acquisitions made by each company each year. 我需要一个变量来计算每家公司每年的收购数量。 Below is the desired output. 以下是所需的输出。

#output should look like following in df1
# ID      year  count of observations in df2 per year
# 111111  2010  3
# 111111  2011  1
# 222222  2010  0
# 333333  2011  1
# 444444  2011  0

I have really tried to come up with a solution but haven't got close enough. 我真的试图提出一个解决方案,但还没有足够接近。 Hope that somebody would have a solution for this problem. 希望有人能解决这个问题。

Thank you in advance! 先感谢您!

Probably the best way is to use left_join, you only need to change NA by 0: 可能最好的方法是使用left_join,你只需要将NA改为0:

df1 <- data.frame(ID=c('111111','111111', '222222', '333333', '444444'),
                  year=c(2010, 2011, 2010, 2011, 2011))
df2 <- data.frame(ID=c('111111', '111111', '111111', '111111', '333333'),
                  year=c(2010,2010,2010,2011,2011))

library(tidyverse)

df2 %>% count(ID, year) -> summ_df2

df1 %>% left_join(summ_df2)
#> Joining, by = c("ID", "year")
#> Warning: Column `ID` joining factors with different levels, coercing to
#> character vector
#>       ID year  n
#> 1 111111 2010  3
#> 2 111111 2011  1
#> 3 222222 2010 NA
#> 4 333333 2011  1
#> 5 444444 2011 NA

Created on 2019-01-29 by the reprex package (v0.2.1) reprex包创建于2019-01-29(v0.2.1)


One chain operation Added following comment by @Ronak Shah 一个连锁经营添加了@Ronak Shah的以下评论

df1 <- data.frame(ID=c('111111','111111', '222222', '333333', '444444'),
                  year=c(2010, 2011, 2010, 2011, 2011))
df2 <- data.frame(ID=c('111111', '111111', '111111', '111111', '333333'),
                  year=c(2010,2010,2010,2011,2011))

library(tidyverse)

df2 %>% 
 count(ID, year) %>% 
 right_join(df1) %>% 
 replace_na(list(n = 0))

#> Joining, by = c("ID", "year")
#> Warning: Column `ID` joining factors with different levels, coercing to
#> character vector
#> # A tibble: 5 x 3
#>   ID      year     n
#>   <chr>  <dbl> <dbl>
#> 1 111111  2010     3
#> 2 111111  2011     1
#> 3 222222  2010     0
#> 4 333333  2011     1
#> 5 444444  2011     0

Created on 2019-01-29 by the reprex package (v0.2.1) reprex包创建于2019-01-29(v0.2.1)

A non-tidyverse solution. 一个非tidyverse解决方案。 I understand this seems more complex than the tidyverse one, just shared it for variety of options. 据我所知,这似乎比tidyverse更复杂,只是分享了各种选项。

df1 <- data.frame(ID=c('111111','111111', '222222', '333333', '444444'),
                  year=c(2010, 2011, 2010, 2011, 2011))

df2 <- data.frame(ID=c('111111', '111111', '111111', '111111', '333333'),
                  year=c(2010,2010,2010,2011,2011))


df1$key <- paste(df1$ID,df1$year,sep = "_")

df2$key <- paste(df2$ID,df2$year,sep = "_")


df1$count_of_year <- unlist(lapply(df1$key,function(x) {sum(df2$key %in% x)}))

df1 <- df1[,c(1,2,4)]

df1
#>       ID year count_of_year
#> 1 111111 2010             3
#> 2 111111 2011             1
#> 3 222222 2010             0
#> 4 333333 2011             1
#> 5 444444 2011             0

Created on 2019-01-29 by the reprex package (v0.2.1) reprex包创建于2019-01-29(v0.2.1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM