使用列值在Spark数据帧中强制转换另一列

Question

I have a dataframe like this: 我有一个这样的数据框：

rdd1 = sc.parallelize([(100,2,1234.5678),(101,3,1234.5678)])
df = spark.createDataFrame(rdd1,(['id','dec','val']))

+---+---+---------+
| id|dec|      val|
+---+---+---------+
|100|  2|1234.5678|
|101|  3|1234.5678|
+---+---+---------+

Based on the value available in dec column, I want the casting to be done on the val column. 基于dec列中的可用值，我希望在val列上进行转换。 Like if dec = 2 , then I want the val to be cast to DecimalType(7,2) . 就像dec = 2 ，那么我希望将val DecimalType(7,2)为DecimalType(7,2) 。

I tried to do the below, but it is not working: 我尝试执行以下操作，但不起作用：

 df.select(col('id'),col('dec'),col('val'),col('val').cast(DecimalType(7,col('dec'))).cast(StringType()).alias('modVal')).show()

Error message: 错误信息：

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/sql/column.py", line 419, in cast
    jdt = spark._jsparkSession.parseDataType(dataType.json())
  File "/usr/lib/spark/python/pyspark/sql/types.py", line 69, in json
    return json.dumps(self.jsonValue(),
  File "/usr/lib/spark/python/pyspark/sql/types.py", line 225, in jsonValue
    return "decimal(%d,%d)" % (self.precision, self.scale)
TypeError: %d format: a number is required, not Column

The same works if I hard code the value to a specific number, which is straight forward. 如果我将值硬编码为一个特定的数字，这也很简单。

df.select(col('id'),col('dec'),col('val'),col('val').cast(DecimalType(7,3)).cast(StringType()).alias('modVal')).show()

+---+---+---------+--------+
| id|dec|      val|  modVal|
+---+---+---------+--------+
|100|  2|1234.5678|1234.568|
|101|  3|1234.5678|1234.568|
+---+---+---------+--------+

Please help me with this. 请帮我解决一下这个。

Answer 1

Spark（或与此相关的任何关系系统）中的列必须是同质的-这样的操作（ cast每行转换为不同的类型）不仅不受支持，而且意义不大。

Answer 2

As mentioned by user10281832, you can't have different data types in the same column. 如user10281832所述，同一列中不能有不同的数据类型。

Since the formatting is in focus you can convert the column to string type and then do the formatting. 由于格式化是重点，因此您可以将列转换为字符串类型，然后进行格式化。 Since the number of decimals for each row is different, you can't use any inbuilt Spark functions but need to define a custom UDF : 由于每一行的小数位数不同，因此您不能使用任何内置的Spark函数，但需要定义一个自定义UDF ：

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def format_val(num, prec):
    return "%0.*f" % (prec, num)

format_val_udf = udf(format_val, StringType())

df.withColumn('modVal', format_val_udf('val', 'dec'))

使用列值在Spark数据帧中强制转换另一列

问题描述

2 个解决方案

解决方案1
2 2018-08-27 21:15:39

解决方案2
1 已采纳 2018-08-28 02:37:21

使用列值在Spark数据帧中强制转换另一列

问题描述

2 个解决方案

解决方案1 2 2018-08-27 21:15:39

解决方案2 1 已采纳 2018-08-28 02:37:21

解决方案1
2 2018-08-27 21:15:39

解决方案2
1 已采纳 2018-08-28 02:37:21