Create Spark Dataframe from Pandas Dataframe with Nested Python Dictionaries and Numpy Arrays

Question

I have a pandas dataframe containing both numpy arrays and dictionaries:

results_df.head(1)

best_params                                    cv_results                                
{'max_depth': 3, 'min_impurity_decrease': 0.2} {'mean_fit_time': [0.6320801575978597, 1.08473]}

I would like to be able to create a Spark Dataframe containing similar nested structures (they can be Spark objects if needed) and I tried:

spark.createDataFrame(results_df)
TypeError: not supported type: <class 'numpy.ndarray'>

Answer 1

One solution is to use a databricks supported module called koalas. The performance is also pretty good. For more info on koalas: https://koalas.readthedocs.io/en/latest/

import koalas as ks
spark_df = ks.from_pandas(pandas_df)

It's as simple as this in koalas!

Create Spark Dataframe from Pandas Dataframe with Nested Python Dictionaries and Numpy Arrays

Question

1 answers

solution1
0 2020-08-16 11:46:05

Create Spark Dataframe from Pandas Dataframe with Nested Python Dictionaries and Numpy Arrays

Question

1 answers

solution1 0 2020-08-16 11:46:05

solution1
0 2020-08-16 11:46:05