[英]Using a columns value in casting another column in a spark dataframe
I have a dataframe like this: 我有一个这样的数据框:
rdd1 = sc.parallelize([(100,2,1234.5678),(101,3,1234.5678)])
df = spark.createDataFrame(rdd1,(['id','dec','val']))
+---+---+---------+
| id|dec| val|
+---+---+---------+
|100| 2|1234.5678|
|101| 3|1234.5678|
+---+---+---------+
Based on the value available in dec
column, I want the casting to be done on the val
column. 基于dec
列中的可用值,我希望在val
列上进行转换。 Like if dec = 2
, then I want the val
to be cast to DecimalType(7,2)
. 就像dec = 2
,那么我希望将val
DecimalType(7,2)
为DecimalType(7,2)
。
I tried to do the below, but it is not working: 我尝试执行以下操作,但不起作用:
df.select(col('id'),col('dec'),col('val'),col('val').cast(DecimalType(7,col('dec'))).cast(StringType()).alias('modVal')).show()
Error message: 错误信息:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/column.py", line 419, in cast
jdt = spark._jsparkSession.parseDataType(dataType.json())
File "/usr/lib/spark/python/pyspark/sql/types.py", line 69, in json
return json.dumps(self.jsonValue(),
File "/usr/lib/spark/python/pyspark/sql/types.py", line 225, in jsonValue
return "decimal(%d,%d)" % (self.precision, self.scale)
TypeError: %d format: a number is required, not Column
The same works if I hard code the value to a specific number, which is straight forward. 如果我将值硬编码为一个特定的数字,这也很简单。
df.select(col('id'),col('dec'),col('val'),col('val').cast(DecimalType(7,3)).cast(StringType()).alias('modVal')).show()
+---+---+---------+--------+
| id|dec| val| modVal|
+---+---+---------+--------+
|100| 2|1234.5678|1234.568|
|101| 3|1234.5678|1234.568|
+---+---+---------+--------+
Please help me with this. 请帮我解决一下这个。
Spark(或与此相关的任何关系系统)中的列必须是同质的-这样的操作( cast
每行转换为不同的类型)不仅不受支持,而且意义不大。
As mentioned by user10281832, you can't have different data types in the same column. 如user10281832所述,同一列中不能有不同的数据类型。
Since the formatting is in focus you can convert the column to string type and then do the formatting. 由于格式化是重点,因此您可以将列转换为字符串类型,然后进行格式化。 Since the number of decimals for each row is different, you can't use any inbuilt Spark functions but need to define a custom UDF
: 由于每一行的小数位数不同,因此您不能使用任何内置的Spark函数,但需要定义一个自定义UDF
:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def format_val(num, prec):
return "%0.*f" % (prec, num)
format_val_udf = udf(format_val, StringType())
df.withColumn('modVal', format_val_udf('val', 'dec'))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.