将 Azure Databricks 中的非 SparkDataFrame 保存为.RData

Question

In Databricks ( SparkR ), I run the batch algorithm of the self-organizing map in parallel from the kohonen package as it gives me considerable reductions in computation time as opposed to my local machine.在 Databricks ( SparkR ) 中，我从kohonen package 并行运行自组织 map 的批处理算法，因为与我的本地计算机相比，它大大减少了计算时间。 However, after fitting the model I would like to download/export the trained model (a list ) to my local machine to continue working with the results (create plots etc.) in a way that is not available in Databricks.但是，在安装 model 后，我想将经过训练的 model（ list ）下载/导出到我的本地机器，以继续以 Databricks 中不可用的方式处理结果（创建绘图等）。 I know how to save & download a SparkDataFrame to csv:我知道如何将 SparkDataFrame 保存并下载到SparkDataFrame ：

sdftest # a SparkDataFrame
write.df(sdftest, path = "dbfs:/FileStore/test.csv", source = "csv", mode = "overwrite")

However, I am not sure how to do this for a 'regular' R list object.但是，我不确定如何为“常规”R list object 执行此操作。

Is there any way to save the output created in Databricks to my local machine in .RData format?有什么方法可以将 Databricks 中创建的 output 以.RData格式保存到我的本地机器上？ If not, is there a workaround that would still allow me to continue working with the model results locally?如果没有，是否有一种解决方法仍然可以让我继续在本地处理 model 结果？

EDIT:编辑：

library(kohonen)

# Load data
sdf.cluster <- read.df("abfss://cluster.csv", source = "csv", header="true", inferSchema = "true")

# Collet SDF to RDF as kohonen::som is not available for SparkDataFrames
rdf.cluster <- SparkR::collect(sdf.cluster)

# Change rdf to matrix as is required by kohonen::som
rdf.som <- as.matrix(rdf.cluster)
  
# Parallel Batch SOM from Kohonen
som.grid <- somgrid(xdim = 5, ydim = 5, topo="hexagonal", 
                neighbourhood.fct="gaussian") 
set.seed(1)
som.model <- som(rdf.som, grid=som.grid, rlen=10, alpha=c(0.05,0.01), keep.data = TRUE, dist.fcts = "euclidean", mode = "online")

Any help is very much appreciated!很感谢任何形式的帮助！

Answer 1

If all your models can fit into the driver's memory, you can use spark.lapply .如果您的所有型号都可以装入驱动程序的 memory，您可以使用spark.lapply 。 It is a distributed version of base lapply which requires a function and a list.它是基本lapply的分布式版本，需要 function 和列表。 Spark will apply the function to each element of the list (like a map) and collect the returned objects. Spark 会将 function 应用于列表的每个元素（如地图）并收集返回的对象。

Here is an example of fitting kohonen models, one for each iris species:这是拟合 kohonen 模型的示例，每个虹膜种类一个：

library(SparkR)
library(kohonen)

fit_model <- function(df) {
  library(kohonen)
  grid_size <- ceiling(nrow(df) ^ (1/2.5))
  som_grid <- somgrid(xdim = grid_size, ydim = grid_size, topo = 'hexagonal', toroidal = T)
  som_model <- som(data.matrix(df), grid = som_grid)
  som_model
}

models <- spark.lapply(split(iris[-5], iris$Species), fit_model)
models

The models variable contains a list of kohonen models fitted in parallel: models变量包含并行拟合的 kohonen 模型列表：

$setosa
SOM of size 5x5 with a hexagonal toroidal topology.
Training data included.

$versicolor
SOM of size 5x5 with a hexagonal toroidal topology.
Training data included.

$virginica
SOM of size 5x5 with a hexagonal toroidal topology.
Training data included.

Then you can save/serialise the R object as usual:然后你可以像往常一样保存/序列化 R object ：

saveRDS(models, file="/dbfs/kohonen_models.rds")

Note that any file stored into /dbfs/ path will be available through the Databrick's DBFS, accesible with the CLI or API.请注意，存储在/dbfs/路径中的任何文件都可以通过 Databrick 的 DBFS 使用，可通过 CLI 或 API 访问。

将 Azure Databricks 中的非 SparkDataFrame 保存为.RData

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-01-21 00:21:56

将 Azure Databricks 中的非 SparkDataFrame 保存为.RData

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-01-21 00:21:56

解决方案1
1 已采纳 2021-01-21 00:21:56