簡體   English   中英

隨着組R的變化,在不同大小的窗口上滾動求和

[英]Rolling sums over different size windows with changing groups R

我已經在該網站上閱讀了有關滾動匯總的所有問答,但是我無法理解大多數復雜的代碼,因此我的調整技能受到限制。 我嘗試實施此處此處此處提出的一些解決方案,但是即使我僅使用1,000行和3列,也可能會出錯或計算機崩潰。 因此,很明顯,我弄亂了代碼。

我的數據看起來像這樣(通過dput的前50行)。 數據集總數約為100,000行

           structure(list(pnum = c("4778744", "4778744", "4778744", "4832724", 
"4840655", "4854957", "4952026", "4832724", "4832724", "4840655", 
"4952026", "4854957", "4952026", "4979975", "5062877", "5062877", 
"4979975", "4979975", "4979975", "5093287", "5148510", "5093287", 
"5148510", "5093287", "5148510", "5093287", "5148510", "5093287", 
"5148510", "5093287", "5148510", "5093287", "5148510", "5212120", 
"5375012", "5168079", "5375012", "5212120", "5212120", "5168079", 
"4811345", "4851990", "4947366", "5142672", "5317715", "4878166", 
"4851990", "5142672", "5317715", "4878166", "5142672", "5317715", 
"4878166", "5142672", "5317715", "4878166", "5142672", "5317715", 
"4878166", "5185878", "4926323", "4926323", "4926323", "4926323", 
"5185878", "4926323", "4926323", "4926323", "4926323", "4926323", 
"4926323", "5129067", "5136697", "5210841", "5237700", "5237700", 
"5237700", "5247644", "5805912", "5828869", "5357626", "5247644", 
"5805912", "5828869", "5357626"), ID = c("03859643-1", "04488864-4", 
"04560399-1", "03859643-1", "03859643-1", "03859643-1", "03859643-1", 
"03901719-2", "04086089-2", "04086089-2", "04407934-2", "04488864-4", 
"04952026-3", "03859643-1", "03859643-1", "03901719-2", "03912481-3", 
"03940277-1", "04979975-2", "03859643-1", "03859643-1", "03864113-1", 
"03864113-1", "04877300-1", "04877300-1", "04877300-3", "04877300-3", 
"05040862-3", "05040862-3", "05093287-4", "05093287-4", "05093287-6", 
"05093287-6", "03859643-1", "03859643-1", "03859643-1", "03870399-2", 
"03901719-2", "03923529-1", "04784976-1", "03860454-2", "03860454-2", 
"03860454-2", "03860454-2", "03860454-2", "03860454-2", "04761567-2", 
"04870622-2", "04870622-2", "04870622-2", "04878166-2", "04878166-2", 
"04878166-2", "04878166-3", "04878166-3", "04878166-3", "04878166-5", 
"04878166-5", "04878166-5", "03860454-2", "03860454-2", "04610004-1", 
"04734852-2", "04734852-3", "04761567-2", "04761567-2", "04777587-1", 
"04835414-1", "04878166-2", "04926323-10", "04926323-5", "03860454-2", 
"03860454-2", "03860454-2", "03860454-2", "05237700-2", "05237700-3", 
"03860454-2", "03860454-2", "03860454-2", "03860454-2", "04731737-1", 
"04731737-1", "04731737-1", "04731737-1"), Time = c(1986L, 1986L, 
1986L, 1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 
1988L, 1988L, 1989L, 1989L, 1989L, 1989L, 1989L, 1989L, 1990L, 
1990L, 1990L, 1990L, 1990L, 1990L, 1990L, 1990L, 1990L, 1990L, 
1990L, 1990L, 1990L, 1990L, 1991L, 1991L, 1991L, 1991L, 1991L, 
1991L, 1991L, 1986L, 1987L, 1987L, 1987L, 1987L, 1987L, 1987L, 
1987L, 1987L, 1987L, 1987L, 1987L, 1987L, 1987L, 1987L, 1987L, 
1987L, 1987L, 1987L, 1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 
1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 1989L, 1989L, 1990L, 
1990L, 1990L, 1990L, 1991L, 1991L, 1991L, 1991L, 1991L, 1991L, 
1991L, 1991L)), .Names = c("pnum", "inventor", "pryear"), row.names = c(1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 
16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 
29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 325L, 
326L, 327L, 328L, 329L, 330L, 331L, 332L, 333L, 334L, 335L, 336L, 
337L, 338L, 339L, 340L, 341L, 342L, 343L, 344L, 345L, 346L, 347L, 
348L, 349L, 350L, 351L, 352L, 353L, 354L, 355L, 356L, 357L, 358L, 
359L, 360L, 361L, 362L, 363L, 364L, 365L, 366L, 367L, 368L, 369L
), class = "data.frame")

多個inventors在特定年份pryear進行項目pnum合作。 我在尋找三件事:

在@Thierry發表評論后,我更改了數據樣本,以確保解決了他指出的問題。

  1. 每個發明人在當前pryear年之前的x(例如3)年窗口中進行的項目pryear ,因此,如果當前項目的年份為1977,我希望包括1974年至1976年之間進行的項目數量。 如果以前沒有發生過,理想的結果是“ 0”。 @Alex 在此處提供的答案可用於實現第一個目標。 但是,正如評論中所討論的那樣,它的效率不是很高(特別是因為我的時間范圍是1952年至2010年,擁有超過50,000名發明者)。
  2. 每個發明人在同一時間窗口內與之合作的不同發明人的總數
  3. 如果一個項目有多個發明人,我正在尋找每個發明人在過去的相同時間范圍內與正在研究當前項目的其他發明人合作的次數

這是您第一個問題的解決方案。 您可以通過練習解決其他問題。

第一個解決方案僅使用dplyr 您可能會遇到大型數據集的問題。

library(dplyr)
df %>% 
  inner_join(
    df %>% 
      select(inventor, oldyear = pryear), 
    by = "inventor") %>% 
  filter(pryear - 3 <= oldyear, oldyear < pryear) %>% 
  group_by(inventor, pryear) %>% 
  summarise(projects = n())

第二種解決方案將dplyr與數據庫后端一起使用。 那應該能夠應付更大的數據集。 請注意,代碼非常相似。

library(RSQLite)
library(dplyr)
conn <- dbConnect(SQLite(), "test")
dbWriteTable(conn, "project", df)
src <- src_sqlite("test")
tbl(src, "project") %>% 
  inner_join(
    tbl(src, "project") %>% 
      select(inventor, oldyear = pryear), 
    by = "inventor") %>% 
  filter(pryear - 3 <= oldyear, oldyear < pryear) %>% 
  group_by(inventor, pryear) %>% 
  summarise(projects = n()) %>% 
  collect()

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM