I have read about all Q&A on rolling sums on this website but I can't make sense of most of the complex code so my tweaking skills are limited. I tried implementing a few solutions proposed, here , here , and here amongst others but either I get errors or my computer crashes, even when I only use 1,000 rows and 3 columns. Hence clearly, I mess up the code.
My data looks like this (first 50 rows via dput). Total dataset is about 100,000 rows
structure(list(pnum = c("4778744", "4778744", "4778744", "4832724",
"4840655", "4854957", "4952026", "4832724", "4832724", "4840655",
"4952026", "4854957", "4952026", "4979975", "5062877", "5062877",
"4979975", "4979975", "4979975", "5093287", "5148510", "5093287",
"5148510", "5093287", "5148510", "5093287", "5148510", "5093287",
"5148510", "5093287", "5148510", "5093287", "5148510", "5212120",
"5375012", "5168079", "5375012", "5212120", "5212120", "5168079",
"4811345", "4851990", "4947366", "5142672", "5317715", "4878166",
"4851990", "5142672", "5317715", "4878166", "5142672", "5317715",
"4878166", "5142672", "5317715", "4878166", "5142672", "5317715",
"4878166", "5185878", "4926323", "4926323", "4926323", "4926323",
"5185878", "4926323", "4926323", "4926323", "4926323", "4926323",
"4926323", "5129067", "5136697", "5210841", "5237700", "5237700",
"5237700", "5247644", "5805912", "5828869", "5357626", "5247644",
"5805912", "5828869", "5357626"), ID = c("03859643-1", "04488864-4",
"04560399-1", "03859643-1", "03859643-1", "03859643-1", "03859643-1",
"03901719-2", "04086089-2", "04086089-2", "04407934-2", "04488864-4",
"04952026-3", "03859643-1", "03859643-1", "03901719-2", "03912481-3",
"03940277-1", "04979975-2", "03859643-1", "03859643-1", "03864113-1",
"03864113-1", "04877300-1", "04877300-1", "04877300-3", "04877300-3",
"05040862-3", "05040862-3", "05093287-4", "05093287-4", "05093287-6",
"05093287-6", "03859643-1", "03859643-1", "03859643-1", "03870399-2",
"03901719-2", "03923529-1", "04784976-1", "03860454-2", "03860454-2",
"03860454-2", "03860454-2", "03860454-2", "03860454-2", "04761567-2",
"04870622-2", "04870622-2", "04870622-2", "04878166-2", "04878166-2",
"04878166-2", "04878166-3", "04878166-3", "04878166-3", "04878166-5",
"04878166-5", "04878166-5", "03860454-2", "03860454-2", "04610004-1",
"04734852-2", "04734852-3", "04761567-2", "04761567-2", "04777587-1",
"04835414-1", "04878166-2", "04926323-10", "04926323-5", "03860454-2",
"03860454-2", "03860454-2", "03860454-2", "05237700-2", "05237700-3",
"03860454-2", "03860454-2", "03860454-2", "03860454-2", "04731737-1",
"04731737-1", "04731737-1", "04731737-1"), Time = c(1986L, 1986L,
1986L, 1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 1988L,
1988L, 1988L, 1989L, 1989L, 1989L, 1989L, 1989L, 1989L, 1990L,
1990L, 1990L, 1990L, 1990L, 1990L, 1990L, 1990L, 1990L, 1990L,
1990L, 1990L, 1990L, 1990L, 1991L, 1991L, 1991L, 1991L, 1991L,
1991L, 1991L, 1986L, 1987L, 1987L, 1987L, 1987L, 1987L, 1987L,
1987L, 1987L, 1987L, 1987L, 1987L, 1987L, 1987L, 1987L, 1987L,
1987L, 1987L, 1987L, 1988L, 1988L, 1988L, 1988L, 1988L, 1988L,
1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 1989L, 1989L, 1990L,
1990L, 1990L, 1990L, 1991L, 1991L, 1991L, 1991L, 1991L, 1991L,
1991L, 1991L)), .Names = c("pnum", "inventor", "pryear"), row.names = c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L,
16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L,
29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 325L,
326L, 327L, 328L, 329L, 330L, 331L, 332L, 333L, 334L, 335L, 336L,
337L, 338L, 339L, 340L, 341L, 342L, 343L, 344L, 345L, 346L, 347L,
348L, 349L, 350L, 351L, 352L, 353L, 354L, 355L, 356L, 357L, 358L,
359L, 360L, 361L, 362L, 363L, 364L, 365L, 366L, 367L, 368L, 369L
), class = "data.frame")
Multiple inventors
collaborate on a project pnum
in a specific year called pryear
. I am looking for three things:
After comments from @Thierry I changed the data sample to ensure that the problem he pointed out was dealt with.
pryear
, thus if year of current project is 1977, I want the number of projects conducted from 1974 until 1976 included. If there are no occurrences before, ideally the result would be '0'. the answer provided by @Alex here can be used to achieve this first goal. But as discussed in the comments, it is not highly efficient (especially as my time range is from 1952 to 2010 with over 50,000 inventors). Here is a solution for you first question. You can solve the other ones as an exercise.
The first solutions uses only dplyr
. You will probably run into problems with large datasets.
library(dplyr)
df %>%
inner_join(
df %>%
select(inventor, oldyear = pryear),
by = "inventor") %>%
filter(pryear - 3 <= oldyear, oldyear < pryear) %>%
group_by(inventor, pryear) %>%
summarise(projects = n())
The second solutions use dplyr
with a database back-end. That should be able to cope with larger datasets. Note that the code is very similar.
library(RSQLite)
library(dplyr)
conn <- dbConnect(SQLite(), "test")
dbWriteTable(conn, "project", df)
src <- src_sqlite("test")
tbl(src, "project") %>%
inner_join(
tbl(src, "project") %>%
select(inventor, oldyear = pryear),
by = "inventor") %>%
filter(pryear - 3 <= oldyear, oldyear < pryear) %>%
group_by(inventor, pryear) %>%
summarise(projects = n()) %>%
collect()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.