简体   繁体   中英

How to convert all column of dataframe to numeric spark scala?

I loaded a csv as dataframe. I would like to cast all columns to float, knowing that the file is to big to write all columns names:

val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val df = spark.read.option("header",true).option("inferSchema", "true").csv("C:/Users/mhattabi/Desktop/dataTest2.csv")

Given this DataFrame as example:

val df = sqlContext.createDataFrame(Seq(("0", 0),("1", 1),("2", 0))).toDF("id", "c0")

with schema:

StructType(
    StructField(id,StringType,true), 
    StructField(c0,IntegerType,false))

You can loop over DF columns by .columns functions:

val castedDF = df.columns.foldLeft(df)((current, c) => current.withColumn(c, col(c).cast("float")))

So the new DF schema looks like:

StructType(
    StructField(id,FloatType,true), 
    StructField(c0,FloatType,false))

EDIT:

If you wanna exclude some columns from casting, you could do something like (supposing we want to exclude the column id ):

val exclude = Array("id")

val someCastedDF = (df.columns.toBuffer --= exclude).foldLeft(df)((current, c) =>
                                              current.withColumn(c, col(c).cast("float")))

where exclude is an Array of all columns we want to exclude from casting.

So the schema of this new DF is:

StructType(
    StructField(id,StringType,true), 
    StructField(c0,FloatType,false))

Please notice that maybe this is not the best solution to do it but it can be a starting point.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM