简体   繁体   中英

How to change the data type of the columns of a PySpark dataframe?

I have a dataframe where all the columns are of type string and I need them to be of type double . I have code that does it:

df_Double = df.select([df[c].cast(DoubleType()).alias(c) for c in df.columns]

The problem is that when I save this new dataframe in memory

df_Double.drop("_c0").toPandas().to_csv("all_Double.csv", header = "true")

and I read it again

df_Double = spark.read \
    .format("csv") \
    .option("inferSchema",True) \
    .option("header", True) \
    .load("all_Double.csv")

and show your schema

df_Double.printSchema()

all columns are of type string like the original dataframe . How can I make the change to be saved in memory and not have to change the data type every time I read the dataframe ?

You can use df_Double.schema when you bring the csv file

df_Double_load = spark.read \
.format("csv") \
.schema(df_Double.schema) \
.option("header", True) \
.load("all_Double.csv")

you shouldn't use the option 'inferSchema' because this option allow spark to set schema automatically. So don't use that option.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM