简体   繁体   English

使用列值在Spark数据帧中强制转换另一列

[英]Using a columns value in casting another column in a spark dataframe

I have a dataframe like this: 我有一个这样的数据框:

rdd1 = sc.parallelize([(100,2,1234.5678),(101,3,1234.5678)])
df = spark.createDataFrame(rdd1,(['id','dec','val']))

+---+---+---------+
| id|dec|      val|
+---+---+---------+
|100|  2|1234.5678|
|101|  3|1234.5678|
+---+---+---------+

Based on the value available in dec column, I want the casting to be done on the val column. 基于dec列中的可用值,我希望在val列上进行转换。 Like if dec = 2 , then I want the val to be cast to DecimalType(7,2) . 就像dec = 2 ,那么我希望将val DecimalType(7,2)DecimalType(7,2)

I tried to do the below, but it is not working: 我尝试执行以下操作,但不起作用:

 df.select(col('id'),col('dec'),col('val'),col('val').cast(DecimalType(7,col('dec'))).cast(StringType()).alias('modVal')).show()

Error message: 错误信息:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/sql/column.py", line 419, in cast
    jdt = spark._jsparkSession.parseDataType(dataType.json())
  File "/usr/lib/spark/python/pyspark/sql/types.py", line 69, in json
    return json.dumps(self.jsonValue(),
  File "/usr/lib/spark/python/pyspark/sql/types.py", line 225, in jsonValue
    return "decimal(%d,%d)" % (self.precision, self.scale)
TypeError: %d format: a number is required, not Column

The same works if I hard code the value to a specific number, which is straight forward. 如果我将值硬编码为一个特定的数字,这也很简单。

df.select(col('id'),col('dec'),col('val'),col('val').cast(DecimalType(7,3)).cast(StringType()).alias('modVal')).show()

+---+---+---------+--------+
| id|dec|      val|  modVal|
+---+---+---------+--------+
|100|  2|1234.5678|1234.568|
|101|  3|1234.5678|1234.568|
+---+---+---------+--------+

Please help me with this. 请帮我解决一下这个。

Spark(或与此相关的任何关系系统)中的列必须是同质的-这样的操作( cast每行转换为不同的类型)不仅不受支持,而且意义不大。

As mentioned by user10281832, you can't have different data types in the same column. 如user10281832所述,同一列中不能有不同的数据类型。

Since the formatting is in focus you can convert the column to string type and then do the formatting. 由于格式化是重点,因此您可以将列转换为字符串类型,然后进行格式化。 Since the number of decimals for each row is different, you can't use any inbuilt Spark functions but need to define a custom UDF : 由于每一行的小数位数不同,因此您不能使用任何内置的Spark函数,但需要定义一个自定义UDF

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def format_val(num, prec):
    return "%0.*f" % (prec, num)

format_val_udf = udf(format_val, StringType())

df.withColumn('modVal', format_val_udf('val', 'dec'))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在另一列中查找列负值 - dataframe - Find columns negative value in another column - dataframe 根据另一列中的值组合数据框的列 - Combining columns of dataframe based on value in another column Pyspark通过使用另一列中的值替换Spark dataframe列中的字符串 - Pyspark replace strings in Spark dataframe column by using values in another column 如何使用其他数据框列的值转换数据框的列值 - How to transform a column value of a dataframe with values of another dataframe columns 根据条件将一个 dataframe 中的列值设置为另一个 dataframe 列 - Setting value of columns in one dataframe to another dataframe column based on condition 如果另一个列中的值是另一个DataFrame中的pandas列? - pandas columns from another DataFrame if value is in another column? 尝试使用 function 使用另一列数据获取 dataframe 列中的值? - Trying to get value in column of dataframe using another columns data using function? 使用pandas / numpy数据框操作特定列(样本特征)以另一列的条目(特征值)为条件 - Manipulate specific columns (sample features) conditional on another column's entries (feature value) using pandas/numpy dataframe 替换 spark 中的一个列值 DataFrame - Replace a column value in the spark DataFrame 根据熊猫数据框中另一列的最后一个值填充列 - Fill columns based on the last value of another column in a pandas dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM