Save non-SparkDataFrame from Azure Databricks to local computer as .RData

Question

In Databricks ( SparkR ), I run the batch algorithm of the self-organizing map in parallel from the kohonen package as it gives me considerable reductions in computation time as opposed to my local machine. However, after fitting the model I would like to download/export the trained model (a list ) to my local machine to continue working with the results (create plots etc.) in a way that is not available in Databricks. I know how to save & download a SparkDataFrame to csv:

sdftest # a SparkDataFrame
write.df(sdftest, path = "dbfs:/FileStore/test.csv", source = "csv", mode = "overwrite")

However, I am not sure how to do this for a 'regular' R list object.

Is there any way to save the output created in Databricks to my local machine in .RData format? If not, is there a workaround that would still allow me to continue working with the model results locally?

EDIT:

library(kohonen)

# Load data
sdf.cluster <- read.df("abfss://cluster.csv", source = "csv", header="true", inferSchema = "true")

# Collet SDF to RDF as kohonen::som is not available for SparkDataFrames
rdf.cluster <- SparkR::collect(sdf.cluster)

# Change rdf to matrix as is required by kohonen::som
rdf.som <- as.matrix(rdf.cluster)
  
# Parallel Batch SOM from Kohonen
som.grid <- somgrid(xdim = 5, ydim = 5, topo="hexagonal", 
                neighbourhood.fct="gaussian") 
set.seed(1)
som.model <- som(rdf.som, grid=som.grid, rlen=10, alpha=c(0.05,0.01), keep.data = TRUE, dist.fcts = "euclidean", mode = "online")

Any help is very much appreciated!

Answer 1

If all your models can fit into the driver's memory, you can use spark.lapply . It is a distributed version of base lapply which requires a function and a list. Spark will apply the function to each element of the list (like a map) and collect the returned objects.

Here is an example of fitting kohonen models, one for each iris species:

library(SparkR)
library(kohonen)

fit_model <- function(df) {
  library(kohonen)
  grid_size <- ceiling(nrow(df) ^ (1/2.5))
  som_grid <- somgrid(xdim = grid_size, ydim = grid_size, topo = 'hexagonal', toroidal = T)
  som_model <- som(data.matrix(df), grid = som_grid)
  som_model
}

models <- spark.lapply(split(iris[-5], iris$Species), fit_model)
models

The models variable contains a list of kohonen models fitted in parallel:

$setosa
SOM of size 5x5 with a hexagonal toroidal topology.
Training data included.

$versicolor
SOM of size 5x5 with a hexagonal toroidal topology.
Training data included.

$virginica
SOM of size 5x5 with a hexagonal toroidal topology.
Training data included.

Then you can save/serialise the R object as usual:

saveRDS(models, file="/dbfs/kohonen_models.rds")

Note that any file stored into /dbfs/ path will be available through the Databrick's DBFS, accesible with the CLI or API.

Save non-SparkDataFrame from Azure Databricks to local computer as .RData

Question

1 answers

solution1
1 ACCPTED 2021-01-21 00:21:56

Save non-SparkDataFrame from Azure Databricks to local computer as .RData

Question

1 answers

solution1 1 ACCPTED 2021-01-21 00:21:56

solution1
1 ACCPTED 2021-01-21 00:21:56