簡體   English   中英

R-在這種情況下,如何通過應用族功能進行向量化並避免while / for循環?

[英]R - How to vectorize with apply family function and avoid while/for loops in this case?

在這種情況下(可以在此問題中找到更多詳細信息: 計算其余日期中有多少個觀測值符合多個條件?(R)

這是一個稱為事件的數據集,其中包含數千個事件(觀察),我選擇了幾行向您顯示數據結構。 它在兩個變量“ LON”,“ LAT”中包含“ STATEid”,“日期”和地理坐標。 我正在寫計算每一行的新變量(列)。 這個新變量應為:“給出任何特定事件,計算其余數據集,並計算在接下來的30/60天內,在50 / 100KM半徑內的同一狀態下發生的事件數。

我編寫了一個帶有while循環的用戶定義函數-為了簡化操作,我僅在30天內以相同狀態包含了2個條件:

n = 1

f = function(i) {
  a = i[n,]
  b = a$date
  # c = a$LON
  # d = a$LAT
  e = a$STATEid
  f = a$RID
  g1 = sum(i$CASE  [i$date<= b+30 & i$date>b & i$STATEid==e], na.rm=T)
  # g2 = sum(i$viold [i$date<= b+30 & i$date>b], na.rm=T)
  # g3 = sum(i$CASE  [i$date<= b+60 & i$date>b], na.rm=T)
  # g4 = sum(i$viold [i$date<= b+60 & i$date>b], na.rm=T)
  # h = cbind(g1, g2, g3, g4)
  g1 = data.frame(g1)
  n = n+1
  assign(as.character(f), g1, envir = .GlobalEnv)
}

對於(n in 1:20)(f(event2))

由於包含23,000個案例,這花了太長時間。 當循環只需要運行兩次時,我的16GB RAM的PC無法釘牢它! 因此,我認為最好避免循環。 您能否建議一種向量化我的代碼並避免循環的方法?

我的主要問題是,當我需要引用每一行時,當需要多個條件時,我不知道如何編寫用戶定義的問題;這就是為什么在循環函數中創建諸如“ a”之類的對象的原因,“ b”,“ c”,“ d”,“ e”來正確地稱呼它們...效率低下-我知道...

我的dput結果如下所示:

     > dput(tail(event2[,c("RID", "STATEid", "date", "LON", "LAT")]))
structure(list(RID = c("023610", "023611", "023613", "023614", 
"023615", "023616"), STATEid = structure(c(36L, 36L, 23L, 23L, 
5L, 14L), .Label = c("alabama", "alaska", "arizona", "arkansas", 
"california", "colorado", "connecticut", "delaware", "district of columbia", 
"florida", "georgia", "hawaii", "idaho", "illinois", "indiana", 
"iowa", "kansas", "kentucky", "louisiana", "maine", "maryland", 
"massachusetts", "michigan", "minnesota", "mississippi", "missouri", 
"montana", "nebraska", "nevada", "new hampshire", "new jersey", 
"new mexico", "new york", "north carolina", "north dakota", "ohio", 
"oklahoma", "oregon", "pennsylvania", "rhode island", "south carolina", 
"south dakota", "tennessee", "texas", "utah", "vermont", "virginia", 
"washington", "west virginia", "wisconsin", "wyoming"), class = "factor"), 
    date = structure(c(3620, -633, 131, -315, 5421, 3558), class = "Date"), 
    LON = c(-80.6495194, -80.6495194, -83.6129939, -83.6129939, 
    -121.6169108, -87.8328505), LAT = c(41.0997803, 41.0997803, 
    42.2411499, 42.2411499, 39.1404477, 42.4461322)), .Names = c("RID", 
"STATEid", "date", "LON", "LAT"), row.names = c(23610L, 23611L, 
23613L, 23614L, 23615L, 23616L), class = "data.frame")
> 

非常感謝。 我感謝您的幫助。

最好,

---------- 2018年1月20日更新---------

我創建了一個有效的循環,並正確反映了我的期望:

g = event2[FALSE,]

USERFUN = function(i) {
  a = i[n,] # retrieve each row from the object, make it a data object
  b = a$date # get date
  # c = a$LON # for now I dropped the idea of calculating radius
  # d = a$LAT # for now I dropped the idea of calculating radius
  e = a$STATEid # get STATE
  f = a$RID # get case ID to name the objects generated!

  PostAct30 = sum(i$CASE [i$date<= b+30 & i$date>b & i$STATEid == e], na.rm=T) # multiple conditions defined here - i is the entire dataset 
  PostAct60 = sum(i$CASE [i$date<= b+60 & i$date>b & i$STATEid == e], na.rm=T) # multiple conditions defined here - b, e are dynamic, retrieving from each line!!!
  PreAct30 = sum(i$CASE [i$date<= b & i$date>b-30 & i$STATEid == e], na.rm=T)
  PreAct60 = sum(i$CASE [i$date<= b & i$date>b-30 & i$STATEid == e], na.rm=T)
  PostVio30 = sum(i$viold [i$date<= b+30 & i$date>b & i$STATEid == e], na.rm=T)
  PostVio60 = sum(i$viold [i$date<= b+60 & i$date>b & i$STATEid == e], na.rm=T)
  PreVio30 = sum(i$viold [i$date<= b & i$date>b-30 & i$STATEid == e], na.rm=T)
  PreVio60 = sum(i$viold [i$date<= b & i$date>b-30 & i$STATEid == e], na.rm=T)
  g1 = data.frame(f, PostAct30, PostAct60, PreAct30, PreAct60, PostVio30, PostVio60, PreVio30, PreVio60)
  n = n+1
  return(g1)
  }
# sum(event2$ca)
n = 1
for (n in 1:19446) {
  g2 = USERFUN(event2)
  g = rbind(g, g2)        
}

AND輸出看起來像這樣:

> tail(event3 [c("date","STATEid", "PostAct30", "PostAct60", "PostVio30", "PostVio60")])
            date    STATEid PostAct30 PostAct60 PostVio30 PostVio60
23611 1968-04-08       ohio         3         4         0         0
23612       <NA>    arizona        NA        NA        NA        NA
23613 1970-05-12   michigan         4         6         2         4
23614 1969-02-20   michigan         2         3         1         1
23615 1984-11-04 california         4         5         0         0
23616 1979-09-29   illinois         0         2         0         1

考慮mapply通過將dateSTATEid逐元素迭代到已定義的函數中,以在適當位置添加新列。 具體來說, mapply產生一個由7列組成的矩陣,您將其分配給event2

dates_calc_fct <- function(b, e) 
  c(sum(event2$CASE [event2$date<= b+30 & event2$date>b & event2$STATEid == e], na.rm=T),
    sum(event2$CASE [event2$date<= b+60 & event2$date>b & event2$STATEid == e], na.rm=T),
    sum(event2$CASE [event2$date<= b & event2$date>b-30 & event2$STATEid == e], na.rm=T),
    sum(event2$CASE [event2$date<= b & event2$date>b-30 & event2$STATEid == e], na.rm=T),
    sum(event2$viold [event2$date<= b+30 & event2$date>b & event2$STATEid == e], na.rm=T),
    sum(event2$viold [event2$date<= b+60 & event2$date>b & event2$STATEid == e], na.rm=T),
    sum(event2$viold [event2$date<= b & event2$date>b-30 & event2$STATEid == e], na.rm=T),
    sum(event2$viold [event2$date<= b & event2$date>b-30 & event2$STATEid == e], na.rm=T)
   )

event2[c("PostAct30", "PostAct60", 
         "PreAct30", "PreAct60",
         "PostVio30", "PostVio60", 
         "PreVio30", "PreVio60")] <- mapply(dates_calc_fct, event$date, event$STATEid)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM