简体   繁体   中英

PySpark cast String to DecimalType without rounding in case of unmatching scale

I need to cast numbers from a column with StringType to a DecimalType . The Decimal type should have a predefined precision and scale, for example, Decimal(2,1) . While the numbers in the String column can not fit to this precision and scale.

Here is the example:

from pyspark.sql.types import DecimalType
import pyspark.sql.functions as F

df = spark.createDataFrame([("-001.399", )],["String"]).withColumn("Decimal", F.col("String").cast(DecimalType(2,1)))
df.show(truncate=False)

+--------+-------+
|String  |Decimal|
+--------+-------+
|-001.399|-1.4   |
+--------+-------+

From the result I see that the number in Decimal format is rounded, which is not a desired behavior in my use case.

1. Is it possible to cast String to Decimal without rounding?

The expected result would be:

+--------+-------+
|String  |Decimal|
+--------+-------+
|-001.399|-1.3   |
+--------+-------+

2. Is it possible to "strictly" cast String to Decimal so that it will return null if the actual scale does not exactly fit to specified scale?

The expected result would be (like if I would try to cast "-0011.399" to Decimal(2,1) ):

+--------+-------+
|String  |Decimal|
+--------+-------+
|-001.399|null   |
+--------+-------+

The following regex would only keep 1 digit after the . decimal point. You could apply this transformation before casting to decimal type.

F.regexp_extract("String", r"^([^.]*(\.\d)?)", 1)

Full test:

from pyspark.sql import functions as F
df = spark.createDataFrame([("-001.399",), ("001.399",), ("-001",), ("001",), ("-001.",), ("001.",), ("-001.39",), ("-001.3",), ("-.39",), (".39",)], ["String"])

df = df.withColumn("String2", F.regexp_extract("String", r"^([^.]*(\.\d)?)", 1))
df = df.withColumn("Decimal", F.col("String2").cast('decimal(2,1)'))

df.show()
# +--------+-------+-------+
# |  String|String2|Decimal|
# +--------+-------+-------+
# |-001.399| -001.3|   -1.3|
# | 001.399|  001.3|    1.3|
# |    -001|   -001|   -1.0|
# |     001|    001|    1.0|
# |   -001.|   -001|   -1.0|
# |    001.|    001|    1.0|
# | -001.39| -001.3|   -1.3|
# |  -001.3| -001.3|   -1.3|
# |    -.39|    -.3|   -0.3|
# |     .39|     .3|    0.3|
# +--------+-------+-------+

If you need more digits after the decimal point, you can insert eg {0,3} into the regex (for up to 3 digits after the decimal point).

r"^([^.]*(\.\d{0,3})?)"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM