简体   繁体   中英

Pyspark dataframe isin function datatype conversion

I am using isin function to filter pyspark dataframe. Surprisingly, although column data type (double) does not match data type in the list (Decimal), there was a match. Can someone help me understand why this is the case?

Example

(Pdb) df.show(3)
+--------------------+---------+------------+
|           employee_id|threshold|wage|
+--------------------+---------+------------+
|AAA |      0.9|         0.5|      
|BBB |      0.8|         0.5|   
|CCC |      0.9|         0.5| 
+--------------------+---------+------------+

(Pdb) df.printSchema()
root
 |-- employee_id: string (nullable = true)
 |-- threshold: double (nullable = true)
 |-- wage: double (nullable = true)

(Pdb) include_thresholds
[Decimal('0.8')]

(Pdb) df.count()
3267                                                                           
(Pdb) df.filter(fn.col("threshold").isin(include_thresholds)).count()
1633

However, if I use normal "in" operator to test if 0.8 belongs to include_thresholds, it's obviously false

(Pdb) 0.8 in include_thresholds
False

Do function col or isin implicitly perform datatype conversion?

When you bring external input to spark for comparison. They are just taken as string and upcasted based on the context.

So what you observe based on numpy datatypes may not hold good in spark.

import decimal
include_thresholds=[decimal.Decimal(0.8)]
include_thresholds2=[decimal.Decimal('0.8')]

0.8 in include_thresholds  # True
0.8 in include_thresholds2  # False

And, note the values

include_thresholds

[Decimal('0.8000000000000000444089209850062616169452667236328125')]

include_thresholds2

[Decimal('0.8')]

Coming to the dataframe

df = spark.sql(""" with t1 (
 select  'AAA'  c1, 0.9 c2,   0.5 c3    union all
 select  'BBB'  c1, 0.8 c2,   0.5 c3    union all
 select  'CCC'  c1, 0.9 c2,   0.5 c3
  )  select   c1 employee_id,   cast(c2 as double)  threshold,   cast(c3 as double) wage    from t1
""")

df.show()
df.printSchema()

+-----------+---------+----+
|employee_id|threshold|wage|
+-----------+---------+----+
|        AAA|      0.9| 0.5|
|        BBB|      0.8| 0.5|
|        CCC|      0.9| 0.5|
+-----------+---------+----+

root
 |-- employee_id: string (nullable = false)
 |-- threshold: double (nullable = false)
 |-- wage: double (nullable = false)

include_thresholds2 would work fine.

df.filter(col("threshold").isin(include_thresholds2)).show()

+-----------+---------+----+
|employee_id|threshold|wage|
+-----------+---------+----+
|        BBB|      0.8| 0.5|
+-----------+---------+----+

Now the below throws error.

df.filter(col("threshold").isin(include_thresholds)).show()

org.apache.spark.sql.AnalysisException: decimal can only support precision up to 38;

as it is taking the value 0.8000000000000000444089209850062616169452667236328125 as such and trying to upcast and thus throwing error.

found the answer in isin documentation:

https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Column.html#isin-java.lang.Object...-

isin public Column isin(Object... list) A boolean expression that is evaluated to true if the value of this expression is contained by the evaluated values of the arguments. Note: Since the type of the elements in the list are inferred only during the run time, the elements will be "up-casted" to the most common type for comparison. For eg: 1) In the case of "Int vs String", the "Int" will be up-casted to "String" and the comparison will look like "String vs String". 2) In the case of "Float vs Double", the "Float" will be up-casted to "Double" and the comparison will look like "Double vs Double"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM