是否有 R 循环函数 (data.table) 可以在不超过内存限制的情况下运行 100 多个 `gam` 结果？

Question

Spatial Interpolation using `gam`使用`gam`空间插值

Statement陈述

I am hoping to get many spatial interpolation outputs using Generalised additive models (GAM).我希望使用广义加性模型 (GAM) 获得许多空间插值输出。 There are no problems for predicting a single pollution map, however, I need more than 100 maps.预测单个污染地图没有问题，但是，我需要100多张地图。 If possible I would like to automate the implementation and also get the results without exceeding the memory limit.如果可能的话，我想自动化实现并在不超过内存限制的情况下获得结果。

Spatial Interpolation process with GAM ( mgcv package)使用 GAM 的空间插值过程（ mgcv包）

Just to let you know, here are the essential steps to get a interpolated map.只是为了让您知道，这里是获取插值地图的基本步骤。

Get the X, Y coordinates of the pollution monitoring stations获取污染监测站的X、Y坐标
Get the pollution data for each station获取每个站点的污染数据
Add the pollution data to the data frame that contains X, Y coordinates将污染数据添加到包含 X、Y 坐标的数据框中
Run gam(pollution ~ s(X,Y, k=20)) for each pollution column为每个污染列运行gam(pollution ~ s(X,Y, k=20))
Create an empty dataframe with min and max X, Y coordinates as a spatial extent创建一个空数据框， min和max X、Y 坐标作为空间范围
Predict the spatial extent using predict and gam result使用predict和gam结果预测空间范围
Run the same job over all pollution fields在所有污染领域运行相同的工作

I will show a hands-on example of how I approached it.我将展示一个关于我如何接近它的实践示例。

Sample data样本数据

To give an example, I created a dataset which is shown below.举个例子，我创建了一个如下所示的数据集。 From the df , you would realise that I have X Y , and 3 pollution variables.从df ，您会意识到我有X Y和 3 个污染变量。

library(data.table)
library(mgcv)

X <- c(197745.8,200443.8,200427.6,208213.4,203691.1,208303.0,202546.4,202407.9,202564.8,194095.5,194508.0,195183.8,185432.5,
       190249.0,190927.0,197490.1,193551.5,204204.4,199508.4,210201.4,212088.3,191886.5,201045.2,187321.7,205987.0)
Y <- c(451633.1,452496.8,448949.5,449753.3,449282.2,453928.5,452923.2,456347.9,461614.8,456729.3,453019.7,450039.7,449472.0,
       444348.1,447274.4,442390.0,443101.2,446446.5,445008.5,446765.2,449508.5,439225.3,460915.6,447392.0,461985.3)
poll1 <- c(34,29,29,33,33,38,35,30,41,43,35,34,41,41,40,36,35,27,53,40,37,32,28,36,33)
poll2 <- c(27,27,34,30,38,36,36,35,37,39,35,33,41,42,40,34,38,31,43,46,38,32,29,33,34)
poll3 <- c(26,30,27,30,37,41,36,36,35,35,35,33,41,36,38,35,34,24,40,43,36,33,30,32,36)

df <- data.table(X, Y, poll1, poll2, poll3)

How I worked on it我是如何工作的

1. Hard code 1. 硬编码

If you look at the code below, you would realised I copy&pasted the same job to all variables.如果您查看下面的代码，您会意识到我将相同的作业复制并粘贴到所有变量中。 This will be extremely hard to implement a lot of variables.这将很难实现很多变量。

# Run gam
gam1 <- gam(poll1 ~ s(X,Y, k=20), data = df)
gam2 <- gam(poll2 ~ s(X,Y, k=20), data = df)
gam3 <- gam(poll3 ~ s(X,Y, k=20), data = df)
         # "there are over 5000 variables that needs looping


# Create an empty surface for prediction
GAM_poll <- data.frame(expand.grid(X = seq(min(df$X), max(df$X), length=200),
                                   Y = seq(min(df$Y), max(df$Y), length=200)))


# Predict gam results to the empty surface
GAM_poll$gam1 <- predict(gam1, GAM_poll, type = "response")
GAM_poll$gam2 <- predict(gam2, GAM_poll, type = "response")
GAM_poll$gam3 <- predict(gam3, GAM_poll, type = "response")

2. Using for Loop 2. 使用for循环

Instead, I made a list and attempted to loop all the variables to get a results.相反，我列了一个列表并尝试循环所有变量以获得结果。 It certainly has no problem per se , but iterating over a multiple variables will take up all the memory (this is what I experienced).它本身当然没有问题，但是迭代多个变量会占用所有内存（这是我所经历的）。

# Run gam using list and for loop
myList <- list()

for(i in 3:length(df)){
  myList[[i-2]] <- gam(df[[i]] ~ s(X,Y, k=20), data = df)
}


# Create an empty surface for prediction
GAM_poll <- data.frame(expand.grid(X = seq(min(df$X), max(df$X), length=200),
                                   Y = seq(min(df$Y), max(df$Y), length=200)))


# Predict gam results to the empty surface
myResult <- list()

for(j in 1:length(myList)){
myResult[[j]] <- predict(myList[[j]], GAM_poll, type = "response")
}

Asking for help寻求帮助

Is there a better way to get the gam results over multiple variables?有没有更好的方法来获得多个变量的gam结果？
Is there a way to not exceed the memory limit during the implementation?有没有办法在实现过程中不超过内存限制？

Can you help me data.table , purrr users?你能帮我data.table ， purrr用户吗？

Answer 1

The solution I created only keeps the latest prediction in memory and saves the others to disk before overwriting it with the next solution.我创建的解决方案只将最新的预测保留在内存中，并将其他预测保存到磁盘，然后再用下一个解决方案覆盖它。 The files are named after the column name of the model in a folder called results.这些文件以名为 results 的文件夹中模型的列名命名。 I also melted the data.table, mostly because I think the code is a little clearer that way.我也融化了 data.table，主要是因为我认为这样代码更清晰一些。

library(data.table)
library(mgcv)

X <- c(197745.8,200443.8,200427.6,208213.4,203691.1,208303.0,202546.4,202407.9,202564.8,194095.5,194508.0,195183.8,185432.5,
       190249.0,190927.0,197490.1,193551.5,204204.4,199508.4,210201.4,212088.3,191886.5,201045.2,187321.7,205987.0)
Y <- c(451633.1,452496.8,448949.5,449753.3,449282.2,453928.5,452923.2,456347.9,461614.8,456729.3,453019.7,450039.7,449472.0,
       444348.1,447274.4,442390.0,443101.2,446446.5,445008.5,446765.2,449508.5,439225.3,460915.6,447392.0,461985.3)
poll1 <- c(34,29,29,33,33,38,35,30,41,43,35,34,41,41,40,36,35,27,53,40,37,32,28,36,33)
poll2 <- c(27,27,34,30,38,36,36,35,37,39,35,33,41,42,40,34,38,31,43,46,38,32,29,33,34)
poll3 <- c(26,30,27,30,37,41,36,36,35,35,35,33,41,36,38,35,34,24,40,43,36,33,30,32,36)

df <- data.table(X, Y, poll1, poll2, poll3)


# melt the data.table
df <- melt.data.table(df, id.vars = c('X', 'Y'))

dir.create('results')
gam1 <- list()
for(i in unique(df$variable)){

  gam1[[i]] <- gam(value ~ s(X,Y, k=20), data = df[variable == i])

  GAM_poll <- data.table(expand.grid(X = seq(min(df$X), max(df$X), length=200),
                                     Y = seq(min(df$Y), max(df$Y), length=200)))


  GAM_poll[, 'prediction' := predict(gam1[[i]], GAM_poll, type = "response")]

  write.csv(GAM_poll$prediction, paste('results/model_', i, '.csv'), row.names = FALSE)

}

是否有 R 循环函数 (data.table) 可以在不超过内存限制的情况下运行 100 多个 `gam` 结果？

问题描述

Spatial Interpolation using `gam`使用`gam`空间插值

Sample data样本数据

How I worked on it我是如何工作的

Asking for help寻求帮助

1 个解决方案

解决方案1
0 已采纳 2020-02-20 19:01:03

是否有 R 循环函数 (data.table) 可以在不超过内存限制的情况下运行 100 多个 `gam` 结果？

问题描述

Spatial Interpolation using gam使用gam空间插值

Sample data样本数据

How I worked on it我是如何工作的

Asking for help寻求帮助

1 个解决方案

解决方案1 0 已采纳 2020-02-20 19:01:03

Spatial Interpolation using `gam`使用`gam`空间插值

解决方案1
0 已采纳 2020-02-20 19:01:03