简体   繁体   English

随着组R的变化,在不同大小的窗口上滚动求和

[英]Rolling sums over different size windows with changing groups R

I have read about all Q&A on rolling sums on this website but I can't make sense of most of the complex code so my tweaking skills are limited. 我已经在该网站上阅读了有关滚动汇总的所有问答,但是我无法理解大多数复杂的代码,因此我的调整技能受到限制。 I tried implementing a few solutions proposed, here , here , and here amongst others but either I get errors or my computer crashes, even when I only use 1,000 rows and 3 columns. 我尝试实施此处此处此处提出的一些解决方案,但是即使我仅使用1,000行和3列,也可能会出错或计算机崩溃。 Hence clearly, I mess up the code. 因此,很明显,我弄乱了代码。

My data looks like this (first 50 rows via dput). 我的数据看起来像这样(通过dput的前50行)。 Total dataset is about 100,000 rows 数据集总数约为100,000行

           structure(list(pnum = c("4778744", "4778744", "4778744", "4832724", 
"4840655", "4854957", "4952026", "4832724", "4832724", "4840655", 
"4952026", "4854957", "4952026", "4979975", "5062877", "5062877", 
"4979975", "4979975", "4979975", "5093287", "5148510", "5093287", 
"5148510", "5093287", "5148510", "5093287", "5148510", "5093287", 
"5148510", "5093287", "5148510", "5093287", "5148510", "5212120", 
"5375012", "5168079", "5375012", "5212120", "5212120", "5168079", 
"4811345", "4851990", "4947366", "5142672", "5317715", "4878166", 
"4851990", "5142672", "5317715", "4878166", "5142672", "5317715", 
"4878166", "5142672", "5317715", "4878166", "5142672", "5317715", 
"4878166", "5185878", "4926323", "4926323", "4926323", "4926323", 
"5185878", "4926323", "4926323", "4926323", "4926323", "4926323", 
"4926323", "5129067", "5136697", "5210841", "5237700", "5237700", 
"5237700", "5247644", "5805912", "5828869", "5357626", "5247644", 
"5805912", "5828869", "5357626"), ID = c("03859643-1", "04488864-4", 
"04560399-1", "03859643-1", "03859643-1", "03859643-1", "03859643-1", 
"03901719-2", "04086089-2", "04086089-2", "04407934-2", "04488864-4", 
"04952026-3", "03859643-1", "03859643-1", "03901719-2", "03912481-3", 
"03940277-1", "04979975-2", "03859643-1", "03859643-1", "03864113-1", 
"03864113-1", "04877300-1", "04877300-1", "04877300-3", "04877300-3", 
"05040862-3", "05040862-3", "05093287-4", "05093287-4", "05093287-6", 
"05093287-6", "03859643-1", "03859643-1", "03859643-1", "03870399-2", 
"03901719-2", "03923529-1", "04784976-1", "03860454-2", "03860454-2", 
"03860454-2", "03860454-2", "03860454-2", "03860454-2", "04761567-2", 
"04870622-2", "04870622-2", "04870622-2", "04878166-2", "04878166-2", 
"04878166-2", "04878166-3", "04878166-3", "04878166-3", "04878166-5", 
"04878166-5", "04878166-5", "03860454-2", "03860454-2", "04610004-1", 
"04734852-2", "04734852-3", "04761567-2", "04761567-2", "04777587-1", 
"04835414-1", "04878166-2", "04926323-10", "04926323-5", "03860454-2", 
"03860454-2", "03860454-2", "03860454-2", "05237700-2", "05237700-3", 
"03860454-2", "03860454-2", "03860454-2", "03860454-2", "04731737-1", 
"04731737-1", "04731737-1", "04731737-1"), Time = c(1986L, 1986L, 
1986L, 1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 
1988L, 1988L, 1989L, 1989L, 1989L, 1989L, 1989L, 1989L, 1990L, 
1990L, 1990L, 1990L, 1990L, 1990L, 1990L, 1990L, 1990L, 1990L, 
1990L, 1990L, 1990L, 1990L, 1991L, 1991L, 1991L, 1991L, 1991L, 
1991L, 1991L, 1986L, 1987L, 1987L, 1987L, 1987L, 1987L, 1987L, 
1987L, 1987L, 1987L, 1987L, 1987L, 1987L, 1987L, 1987L, 1987L, 
1987L, 1987L, 1987L, 1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 
1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 1989L, 1989L, 1990L, 
1990L, 1990L, 1990L, 1991L, 1991L, 1991L, 1991L, 1991L, 1991L, 
1991L, 1991L)), .Names = c("pnum", "inventor", "pryear"), row.names = c(1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 
16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 
29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 325L, 
326L, 327L, 328L, 329L, 330L, 331L, 332L, 333L, 334L, 335L, 336L, 
337L, 338L, 339L, 340L, 341L, 342L, 343L, 344L, 345L, 346L, 347L, 
348L, 349L, 350L, 351L, 352L, 353L, 354L, 355L, 356L, 357L, 358L, 
359L, 360L, 361L, 362L, 363L, 364L, 365L, 366L, 367L, 368L, 369L
), class = "data.frame")

Multiple inventors collaborate on a project pnum in a specific year called pryear . 多个inventors在特定年份pryear进行项目pnum合作。 I am looking for three things: 我在寻找三件事:

After comments from @Thierry I changed the data sample to ensure that the problem he pointed out was dealt with. 在@Thierry发表评论后,我更改了数据样本,以确保解决了他指出的问题。

  1. The number of projects conducted by each individual inventors in an x (say 3) year window before the current pryear , thus if year of current project is 1977, I want the number of projects conducted from 1974 until 1976 included. 每个发明人在当前pryear年之前的x(例如3)年窗口中进行的项目pryear ,因此,如果当前项目的年份为1977,我希望包括1974年至1976年之间进行的项目数量。 If there are no occurrences before, ideally the result would be '0'. 如果以前没有发生过,理想的结果是“ 0”。 the answer provided by @Alex here can be used to achieve this first goal. @Alex 在此处提供的答案可用于实现第一个目标。 But as discussed in the comments, it is not highly efficient (especially as my time range is from 1952 to 2010 with over 50,000 inventors). 但是,正如评论中所讨论的那样,它的效率不是很高(特别是因为我的时间范围是1952年至2010年,拥有超过50,000名发明者)。
  2. The total number of different inventors with whom each inventor has worked in that same time window 每个发明人在同一时间窗口内与之合作的不同发明人的总数
  3. If a project has multiple inventors, I am looking for the number of times each inventor has collaborated with the other inventors who are working on the current project during the same past time window 如果一个项目有多个发明人,我正在寻找每个发明人在过去的相同时间范围内与正在研究当前项目的其他发明人合作的次数

Here is a solution for you first question. 这是您第一个问题的解决方案。 You can solve the other ones as an exercise. 您可以通过练习解决其他问题。

The first solutions uses only dplyr . 第一个解决方案仅使用dplyr You will probably run into problems with large datasets. 您可能会遇到大型数据集的问题。

library(dplyr)
df %>% 
  inner_join(
    df %>% 
      select(inventor, oldyear = pryear), 
    by = "inventor") %>% 
  filter(pryear - 3 <= oldyear, oldyear < pryear) %>% 
  group_by(inventor, pryear) %>% 
  summarise(projects = n())

The second solutions use dplyr with a database back-end. 第二种解决方案将dplyr与数据库后端一起使用。 That should be able to cope with larger datasets. 那应该能够应付更大的数据集。 Note that the code is very similar. 请注意,代码非常相似。

library(RSQLite)
library(dplyr)
conn <- dbConnect(SQLite(), "test")
dbWriteTable(conn, "project", df)
src <- src_sqlite("test")
tbl(src, "project") %>% 
  inner_join(
    tbl(src, "project") %>% 
      select(inventor, oldyear = pryear), 
    by = "inventor") %>% 
  filter(pryear - 3 <= oldyear, oldyear < pryear) %>% 
  group_by(inventor, pryear) %>% 
  summarise(projects = n()) %>% 
  collect()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM