繁体   English   中英

R-在这种情况下,如何通过应用族功能进行向量化并避免while / for循环?

[英]R - How to vectorize with apply family function and avoid while/for loops in this case?

在这种情况下(可以在此问题中找到更多详细信息: 计算其余日期中有多少个观测值符合多个条件?(R)

这是一个称为事件的数据集,其中包含数千个事件(观察),我选择了几行向您显示数据结构。 它在两个变量“ LON”,“ LAT”中包含“ STATEid”,“日期”和地理坐标。 我正在写计算每一行的新变量(列)。 这个新变量应为:“给出任何特定事件,计算其余数据集,并计算在接下来的30/60天内,在50 / 100KM半径内的同一状态下发生的事件数。

我编写了一个带有while循环的用户定义函数-为了简化操作,我仅在30天内以相同状态包含了2个条件:

n = 1

f = function(i) {
  a = i[n,]
  b = a$date
  # c = a$LON
  # d = a$LAT
  e = a$STATEid
  f = a$RID
  g1 = sum(i$CASE  [i$date<= b+30 & i$date>b & i$STATEid==e], na.rm=T)
  # g2 = sum(i$viold [i$date<= b+30 & i$date>b], na.rm=T)
  # g3 = sum(i$CASE  [i$date<= b+60 & i$date>b], na.rm=T)
  # g4 = sum(i$viold [i$date<= b+60 & i$date>b], na.rm=T)
  # h = cbind(g1, g2, g3, g4)
  g1 = data.frame(g1)
  n = n+1
  assign(as.character(f), g1, envir = .GlobalEnv)
}

对于(n in 1:20)(f(event2))

由于包含23,000个案例,这花了太长时间。 当循环只需要运行两次时,我的16GB RAM的PC无法钉牢它! 因此,我认为最好避免循环。 您能否建议一种向量化我的代码并避免循环的方法?

我的主要问题是,当我需要引用每一行时,当需要多个条件时,我不知道如何编写用户定义的问题;这就是为什么在循环函数中创建诸如“ a”之类的对象的原因,“ b”,“ c”,“ d”,“ e”来正确地称呼它们...效率低下-我知道...

我的dput结果如下所示:

     > dput(tail(event2[,c("RID", "STATEid", "date", "LON", "LAT")]))
structure(list(RID = c("023610", "023611", "023613", "023614", 
"023615", "023616"), STATEid = structure(c(36L, 36L, 23L, 23L, 
5L, 14L), .Label = c("alabama", "alaska", "arizona", "arkansas", 
"california", "colorado", "connecticut", "delaware", "district of columbia", 
"florida", "georgia", "hawaii", "idaho", "illinois", "indiana", 
"iowa", "kansas", "kentucky", "louisiana", "maine", "maryland", 
"massachusetts", "michigan", "minnesota", "mississippi", "missouri", 
"montana", "nebraska", "nevada", "new hampshire", "new jersey", 
"new mexico", "new york", "north carolina", "north dakota", "ohio", 
"oklahoma", "oregon", "pennsylvania", "rhode island", "south carolina", 
"south dakota", "tennessee", "texas", "utah", "vermont", "virginia", 
"washington", "west virginia", "wisconsin", "wyoming"), class = "factor"), 
    date = structure(c(3620, -633, 131, -315, 5421, 3558), class = "Date"), 
    LON = c(-80.6495194, -80.6495194, -83.6129939, -83.6129939, 
    -121.6169108, -87.8328505), LAT = c(41.0997803, 41.0997803, 
    42.2411499, 42.2411499, 39.1404477, 42.4461322)), .Names = c("RID", 
"STATEid", "date", "LON", "LAT"), row.names = c(23610L, 23611L, 
23613L, 23614L, 23615L, 23616L), class = "data.frame")
> 

非常感谢。 我感谢您的帮助。

最好,

---------- 2018年1月20日更新---------

我创建了一个有效的循环,并正确反映了我的期望:

g = event2[FALSE,]

USERFUN = function(i) {
  a = i[n,] # retrieve each row from the object, make it a data object
  b = a$date # get date
  # c = a$LON # for now I dropped the idea of calculating radius
  # d = a$LAT # for now I dropped the idea of calculating radius
  e = a$STATEid # get STATE
  f = a$RID # get case ID to name the objects generated!

  PostAct30 = sum(i$CASE [i$date<= b+30 & i$date>b & i$STATEid == e], na.rm=T) # multiple conditions defined here - i is the entire dataset 
  PostAct60 = sum(i$CASE [i$date<= b+60 & i$date>b & i$STATEid == e], na.rm=T) # multiple conditions defined here - b, e are dynamic, retrieving from each line!!!
  PreAct30 = sum(i$CASE [i$date<= b & i$date>b-30 & i$STATEid == e], na.rm=T)
  PreAct60 = sum(i$CASE [i$date<= b & i$date>b-30 & i$STATEid == e], na.rm=T)
  PostVio30 = sum(i$viold [i$date<= b+30 & i$date>b & i$STATEid == e], na.rm=T)
  PostVio60 = sum(i$viold [i$date<= b+60 & i$date>b & i$STATEid == e], na.rm=T)
  PreVio30 = sum(i$viold [i$date<= b & i$date>b-30 & i$STATEid == e], na.rm=T)
  PreVio60 = sum(i$viold [i$date<= b & i$date>b-30 & i$STATEid == e], na.rm=T)
  g1 = data.frame(f, PostAct30, PostAct60, PreAct30, PreAct60, PostVio30, PostVio60, PreVio30, PreVio60)
  n = n+1
  return(g1)
  }
# sum(event2$ca)
n = 1
for (n in 1:19446) {
  g2 = USERFUN(event2)
  g = rbind(g, g2)        
}

AND输出看起来像这样:

> tail(event3 [c("date","STATEid", "PostAct30", "PostAct60", "PostVio30", "PostVio60")])
            date    STATEid PostAct30 PostAct60 PostVio30 PostVio60
23611 1968-04-08       ohio         3         4         0         0
23612       <NA>    arizona        NA        NA        NA        NA
23613 1970-05-12   michigan         4         6         2         4
23614 1969-02-20   michigan         2         3         1         1
23615 1984-11-04 california         4         5         0         0
23616 1979-09-29   illinois         0         2         0         1

考虑mapply通过将dateSTATEid逐元素迭代到已定义的函数中,以在适当位置添加新列。 具体来说, mapply产生一个由7列组成的矩阵,您将其分配给event2

dates_calc_fct <- function(b, e) 
  c(sum(event2$CASE [event2$date<= b+30 & event2$date>b & event2$STATEid == e], na.rm=T),
    sum(event2$CASE [event2$date<= b+60 & event2$date>b & event2$STATEid == e], na.rm=T),
    sum(event2$CASE [event2$date<= b & event2$date>b-30 & event2$STATEid == e], na.rm=T),
    sum(event2$CASE [event2$date<= b & event2$date>b-30 & event2$STATEid == e], na.rm=T),
    sum(event2$viold [event2$date<= b+30 & event2$date>b & event2$STATEid == e], na.rm=T),
    sum(event2$viold [event2$date<= b+60 & event2$date>b & event2$STATEid == e], na.rm=T),
    sum(event2$viold [event2$date<= b & event2$date>b-30 & event2$STATEid == e], na.rm=T),
    sum(event2$viold [event2$date<= b & event2$date>b-30 & event2$STATEid == e], na.rm=T)
   )

event2[c("PostAct30", "PostAct60", 
         "PreAct30", "PreAct60",
         "PostVio30", "PostVio60", 
         "PreVio30", "PreVio60")] <- mapply(dates_calc_fct, event$date, event$STATEid)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM