[英]Save non-SparkDataFrame from Azure Databricks to local computer as .RData
In Databricks ( SparkR
), I run the batch algorithm of the self-organizing map in parallel from the kohonen
package as it gives me considerable reductions in computation time as opposed to my local machine.在 Databricks (
SparkR
) 中,我从kohonen
package 并行运行自组织 map 的批处理算法,因为与我的本地计算机相比,它大大减少了计算时间。 However, after fitting the model I would like to download/export the trained model (a list
) to my local machine to continue working with the results (create plots etc.) in a way that is not available in Databricks.但是,在安装 model 后,我想将经过训练的 model(
list
)下载/导出到我的本地机器,以继续以 Databricks 中不可用的方式处理结果(创建绘图等)。 I know how to save & download a SparkDataFrame
to csv:我知道如何将 SparkDataFrame 保存并下载到
SparkDataFrame
:
sdftest # a SparkDataFrame
write.df(sdftest, path = "dbfs:/FileStore/test.csv", source = "csv", mode = "overwrite")
However, I am not sure how to do this for a 'regular' R list
object.但是,我不确定如何为“常规”R
list
object 执行此操作。
Is there any way to save the output created in Databricks to my local machine in .RData
format?有什么方法可以将 Databricks 中创建的 output 以
.RData
格式保存到我的本地机器上? If not, is there a workaround that would still allow me to continue working with the model results locally?如果没有,是否有一种解决方法仍然可以让我继续在本地处理 model 结果?
EDIT:编辑:
library(kohonen)
# Load data
sdf.cluster <- read.df("abfss://cluster.csv", source = "csv", header="true", inferSchema = "true")
# Collet SDF to RDF as kohonen::som is not available for SparkDataFrames
rdf.cluster <- SparkR::collect(sdf.cluster)
# Change rdf to matrix as is required by kohonen::som
rdf.som <- as.matrix(rdf.cluster)
# Parallel Batch SOM from Kohonen
som.grid <- somgrid(xdim = 5, ydim = 5, topo="hexagonal",
neighbourhood.fct="gaussian")
set.seed(1)
som.model <- som(rdf.som, grid=som.grid, rlen=10, alpha=c(0.05,0.01), keep.data = TRUE, dist.fcts = "euclidean", mode = "online")
Any help is very much appreciated!很感谢任何形式的帮助!
If all your models can fit into the driver's memory, you can use spark.lapply
.如果您的所有型号都可以装入驱动程序的 memory,您可以使用
spark.lapply
。 It is a distributed version of base lapply
which requires a function and a list.它是基本
lapply
的分布式版本,需要 function 和列表。 Spark will apply the function to each element of the list (like a map) and collect the returned objects. Spark 会将 function 应用于列表的每个元素(如地图)并收集返回的对象。
Here is an example of fitting kohonen models, one for each iris species:这是拟合 kohonen 模型的示例,每个虹膜种类一个:
library(SparkR)
library(kohonen)
fit_model <- function(df) {
library(kohonen)
grid_size <- ceiling(nrow(df) ^ (1/2.5))
som_grid <- somgrid(xdim = grid_size, ydim = grid_size, topo = 'hexagonal', toroidal = T)
som_model <- som(data.matrix(df), grid = som_grid)
som_model
}
models <- spark.lapply(split(iris[-5], iris$Species), fit_model)
models
The models
variable contains a list of kohonen models fitted in parallel: models
变量包含并行拟合的 kohonen 模型列表:
$setosa
SOM of size 5x5 with a hexagonal toroidal topology.
Training data included.
$versicolor
SOM of size 5x5 with a hexagonal toroidal topology.
Training data included.
$virginica
SOM of size 5x5 with a hexagonal toroidal topology.
Training data included.
Then you can save/serialise the R object as usual:然后你可以像往常一样保存/序列化 R object :
saveRDS(models, file="/dbfs/kohonen_models.rds")
Note that any file stored into /dbfs/
path will be available through the Databrick's DBFS, accesible with the CLI or API.请注意,存储在
/dbfs/
路径中的任何文件都可以通过 Databrick 的 DBFS 使用,可通过 CLI 或 API 访问。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.