can not cast values in spark scala dataframe

Question

I am trying to parse the data from numbers

Enviroment: DataBricks Scala 2.12 Spark 3.1

I had chosen columns that were incorrectly parsed as Strings the reason is that sometimes numbers were written with coma sometimes with dot.

I am trying to first replace all commas to dots parse it as floats, create schema with type of floating numbers and recreate the dataframe but it does not work.

import org.apache.spark.sql._
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, FloatType};
import org.apache.spark.sql.{Row, SparkSession}
import sqlContext.implicits._ 
//temp is a dataframe with data that I included below
val jj = temp.collect().map(row=> Row(row.toSeq.map(it=> if(it==null) {null} else {it.asInstanceOf[String].replace( ",", ".").toFloat }) ))
val schemaa = temp.columns.map(colN=> (StructField(colN, FloatType, true)))
val newDatFrame = spark.createDataFrame(jj,schemaa)

Data screen

CSV

Podana aktywność,CRP(6 mcy),WBC(6 mcy),SUV (max) w miejscu zapalenia,SUV (max) tła,tumor to background ratio
218,72,"15,2",16,"1,8","8,888888889"
"199,7",200,"16,5","21,5","1,4","15,35714286"
270,42,"11,17","7,6","2,4","3,166666667"
200,226,"29,6",9,"2,8","3,214285714"
200,45,"13,85",17,"2,1","8,095238095"
300,null,"37,8","6,19","2,5","2,476"
290,175,"7,35",9,"2,4","3,75"
279,160,"8,36",13,2,"6,5"
202,24,10,"6,7","2,6","2,576923077"
334,"22,9","8,01",12,"2,4",5
"200,4",null,"25,56",7,"2,4","2,916666667"
198,102,"8,36","7,4","1,8","4,111111111"
"211,6","26,7","10,8","4,2","1,6","2,625"
205,null,null,"9,7","2,07","4,685990338"
326,300,18,14,"2,4","5,833333333"
270,null,null,15,"2,5",6
258,null,null,6,"2,5","2,4"
300,197,"13,5","12,5","2,6","4,807692308"
200,89,"20,9","4,8","1,7","2,823529412"
"201,7",28,null,11,"1,8","6,111111111"
198,9,13,9,2,"4,5"
264,null,"20,3",12,"2,5","4,8"
230,31,"13,3","4,8","1,8","2,666666667"
284,107,"9,92","5,8","1,49","3,89261745"
252,270,null,8,"1,56","5,128205128"
266,null,null,"10,4","1,95","5,333333333"
242,null,null,"14,7",2,"7,35"
259,null,null,"10,01","1,65","6,066666667"
224,null,null,"4,2","1,86","2,258064516"
306,148,10.3,11,1.9,"0,0002488406289"
294,null,5.54,"9,88","1,93","5,119170984"

Answer 1

You can map the columns using Spark SQL regexp_replace . collect is not needed and will not give a good performance. You might also want to use double instead of float because some entries have many decimal places.

val new_df = df.select(
    df.columns.map(
        c => regexp_replace(col(c), ",", ".").cast("double").as(c)
    ):_*
)

can not cast values in spark scala dataframe

Question

1 answers

solution1
1 ACCPTED 2021-03-18 13:12:03

can not cast values in spark scala dataframe

Question

1 answers

solution1 1 ACCPTED 2021-03-18 13:12:03

solution1
1 ACCPTED 2021-03-18 13:12:03