How to change the data type of the columns of a PySpark dataframe?

Question

I have a dataframe where all the columns are of type string and I need them to be of type double . I have code that does it:

df_Double = df.select([df[c].cast(DoubleType()).alias(c) for c in df.columns]

The problem is that when I save this new dataframe in memory

df_Double.drop("_c0").toPandas().to_csv("all_Double.csv", header = "true")

and I read it again

df_Double = spark.read \
    .format("csv") \
    .option("inferSchema",True) \
    .option("header", True) \
    .load("all_Double.csv")

and show your schema

df_Double.printSchema()

all columns are of type string like the original dataframe . How can I make the change to be saved in memory and not have to change the data type every time I read the dataframe ?

Answer 1

You can use df_Double.schema when you bring the csv file

df_Double_load = spark.read \
.format("csv") \
.schema(df_Double.schema) \
.option("header", True) \
.load("all_Double.csv")

you shouldn't use the option 'inferSchema' because this option allow spark to set schema automatically. So don't use that option.

How to change the data type of the columns of a PySpark dataframe?

Question

1 answers

solution1
0 2022-05-18 08:18:19

How to change the data type of the columns of a PySpark dataframe?

Question

1 answers

solution1 0 2022-05-18 08:18:19

solution1
0 2022-05-18 08:18:19