I have a dataframe with some attributes and it has the next appearence:
+-------+-------+
| Atr1 | Atr2 |
+-------+-------+
| 3,06 | 4,08 |
| 3,03 | 4,08 |
| 3,06 | 4,08 |
| 3,06 | 4,08 |
| 3,06 | 4,08 |
| ... | ... |
+-------+-------+
As you can see, the values of the Atr1 and Atr2 of the dataframe are numbers that has a ',' character. This is because I have loaded those data from a CSV where the decimals of the DoubleType numbers were represented by ','.
When I load the data into a dataframe the values are cast to String, so I applied a casting from String to DoubleType for those attributes like this:
df = df.withColumn("Atr1", df["Atr1"].cast(DoubleType()))
df = df.withColumn("Atr2", df["Atr2"].cast(DoubleType()))
But when I do it, the values are converted to null
+-------+-------+
| Atr1 | Atr2 |
+-------+-------+
| null | null |
| null | null |
| null | null |
| null | null |
| null | null |
| ... | ... |
+-------+-------+
I guess that the reason is that DoubleType decimals must be separated by '.' instead of by ','. But I don't have the chance to edit the CSV file, so I want to replace the ',' signs from the Dataframe by '.' and then apply the casting to DoubleType.
How could I do it?
You can simply solve this problem by using an user defined function.
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.functions import *
data = [Row(Atr1="3,06", Atr2="4,08"),
Row(Atr1="3,06", Atr2="4,08"),
Row(Atr1="3,06", Atr2="4,08")]
df = sqlContext.createDataFrame(data)
# Create an user defined function to replace ',' for '.'
udf = UserDefinedFunction(lambda x: x.replace(",","."), StringType())
out = df
.withColumn("Atr1", udf(col("Atr1")).cast(DoubleType()))
.withColumn("Atr2", udf(col("Atr2")).cast(DoubleType()))
##############################################################
out.show()
+----+----+
|Atr1|Atr2|
+----+----+
|3.06|4.08|
|3.06|4.08|
|3.06|4.08|
+----+----+
##############################################################
out.printSchema()
root
|-- Atr1: double (nullable = true)
|-- Atr2: double (nullable = true)
EDIT: More compact solution following suggestion from comments.
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.functions import *
udf = UserDefinedFunction(lambda x: float(x.replace(",",".")), DoubleType())
out = df
.withColumn("Atr1", udf(col("Atr1")))
.withColumn("Atr2", udf(col("Atr2")))
Let's assume you have:
sdf.show()
+-------+-------+
| Atr1| Atr2|
+-------+-------+
| 3,06 | 4,08 |
| 3,03 | 4,08 |
| 3,06 | 4,08 |
| 3,06 | 4,08 |
| 3,06 | 4,08 |
+-------+-------+
Then the following code will produce the desired result:
strToDouble = udf(lambda x: float(x.replace(",",".")), DoubleType())
sdf = sdf.withColumn("Atr1", strToDouble(sdf['Atr1']))
sdf = sdf.withColumn("Atr2", strToDouble(sdf['Atr2']))
sdf.show()
+----+----+
|Atr1|Atr2|
+----+----+
|3.06|4.08|
|3.03|4.08|
|3.06|4.08|
|3.06|4.08|
|3.06|4.08|
+----+----+
You can also do it with just SQL.
val df = sc.parallelize(Array(
("3,06", "4,08"),
("3,06", "4,08"),
("3,06", "4,08"),
("3,06", "4,08"),
("3,06", "4,08"),
("3,06", "4,08"),
("3,06", "4,08"),
("3,06", "4,08")
)).toDF("a", "b")
df.registerTempTable("test")
val doubleDF = sqlContext.sql("select cast(trim(regexp_replace( a , ',' , '.')) as double) as a from test ")
doubleDF.show
+----+
| a|
+----+
|3.06|
|3.06|
|3.06|
|3.06|
|3.06|
|3.06|
|3.06|
|3.06|
+----+
doubleDF.printSchema
root
|-- a: double (nullable = true)
is it possible to pass the column name as a parameter to the col() function in your sample code? Something like this:
# Create an user defined function to replace ',' for '.'
udf = UserDefinedFunction(lambda x: x.replace(",","."), StringType())
col_name1 = "Atr1"
col_name2 = "Atr2"
out = df
.withColumn(col_name1, udf(col(col_name1)).cast(DoubleType()))
.withColumn(col_name2, udf(col(col_name2)).cast(DoubleType()))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.