How can I convert an empty pandas dataframe to Pyspark dataframe?

Question

I'd like a safe way to convert a pandas dataframe to a pyspark dataframe which can handle cases where the pandas dataframe is empty (lets say after some filter has been applied). For example the following will fail:

Assumes you have a spark session

import pandas as pd
raw_data = []
cols = ['col_1', 'col_2', 'col_3']
types_dict = {
    'col_1': str,
    'col_2': float,
    'col_3': bool
}
pandas_df = pd.DataFrame(raw_data, columns=cols).astype(types_dict)
spark_df = spark.createDataframe(pandas_df)

Resulting error: ValueError: can not infer schema from empty dataset

One option is to build a function which could iterate through the pandas dtypes and construct a Pyspark dataframe schema, but that could get a little complicated with structs and whatnot. Is there a simpler solution?

How can I convert an empty pandas dataframe to a Pyspark dataframe and maintain the column datatypes?

Answer 1

If I understand correctly your problem try something with try-except block.

def test(df):
       try:
          """
          What ever the operations you want on your df.
          """
       except:
          df = pd.DataFrame({'col_1': pd.Series(dtype='str'),
               'col_2': pd.Series(dtype='float'),
               'col_3': pd.Series(dtype='bool'),
               })
return df

How can I convert an empty pandas dataframe to Pyspark dataframe?

Question

1 answers

solution1
0 2022-08-16 15:21:26

How can I convert an empty pandas dataframe to Pyspark dataframe?

Question

1 answers

solution1 0 2022-08-16 15:21:26

solution1
0 2022-08-16 15:21:26