基于 pyspark 中的列值的舍入

Question

I need to round off summary_measure_value based on reading_precision value我需要根据 reading_precision 值四舍五入summary_measure_value

from pyspark.sql.functions import *
import pyspark.sql.functions as F
from pyspark.sql import *

df = spark.createDataFrame(
[(123, 2897402, 43.25, 2),
(124, 2897402, 49.25, 0),
(125, 2897402, 43.25, 2), 
(126, 2897402, 48.75, 0)]
, ['model_id','lab_test_id','summary_measure_value','reading_precision'])


partition_by_reading = [
    "model_id",
    "lab_test_id"
]
df.withColumn(
        "reading_value",
        round(avg("summary_measure_value").over(
                    Window.partitionBy(partition_by_reading))
                ,col("reading_precision"))).show()

I'm getting TypeError: 'Column' object is not callable我收到 TypeError: 'Column' 对象不可调用

Answer 1

The pyspark round function expects a constant value for the scale/precision. pyspark round函数需要一个恒定的比例/精度值。 You may have better luck creating a custom udf that applies the rounding logic.您可能会更幸运地创建一个应用舍入逻辑的自定义 udf。

I've included an example below from a test I've done based on your shared example :我在下面根据您的共享示例所做的测试中包含了一个示例：

udf_round = F.udf(lambda val,precision: round(val,precision))
df.withColumn(
        "reading_value",
        udf_round(F.avg("summary_measure_value").over(
                    Window.partitionBy(partition_by_reading))
                ,F.col("reading_precision"))).show()

NB.注意。round in the udf above refers to the built-in python round function.上面udf中的round是指python内置的round函数。

result:结果：

model_id型号标识	lab_test_id lab_test_id	summary_measure_value summary_measure_value	reading_precision读数精度	reading_value读数值
125 125	2897402 2897402	43.25 43.25	2 2	43.25 43.25
124 124	2897402 2897402	49.25 49.25	0 0	49.0 49.0
123 123	2897402 2897402	43.25 43.25	2 2	43.25 43.25
126 126	2897402 2897402	48.75 48.75	0 0	49.0 49.0

Edit 1:编辑1：

If you have erroneous data the following udf is more robust如果你有错误的数据，下面的 udf 更健壮

@udf
def udf_round(val,precision) -> float:
    try:
        # ensure that value is a float and precision is an integer
        return float(round(float(val),int(precision)))
    except:
        return val # return val if there are any errors

Answer 2

Out of curiosity, applied a pandas_udf iterating multiple series.出于好奇，应用了迭代多个系列的 pandas_udf。 Seems faster than I thought似乎比我想象的要快

from pyspark.sql.functions import *
import pyspark.sql.functions as F
from pyspark.sql import *
from typing import Iterator, Tuple
from pyspark.sql.functions import struct, col


@pandas_udf('double')
def round_series(iterator: Iterator[Tuple[pd.Series, pd.DataFrame]]) -> Iterator[pd.Series]:
    return(a.round(b) for a,b in iterator)

df1=df.withColumn("reading_value",F.avg("summary_measure_value").over( Window.partitionBy("model_id", "lab_test_id"))).withColumn("reading_value",round_series('summary_measure_value','reading_precision')).show()

+--------+-----------+---------------------+-----------------+-------------+
|model_id|lab_test_id|summary_measure_value|reading_precision|reading_value|
+--------+-----------+---------------------+-----------------+-------------+
|     123|    2897402|                43.25|                2|        43.25|
|     124|    2897402|                49.11|                1|         49.1|
|     125|    2897402|                43.25|                2|        43.25|
|     126|    2897402|                48.75|                0|         49.0|
+--------+-----------+---------------------+-----------------+-------------+

基于 pyspark 中的列值的舍入

问题描述

2 个解决方案

解决方案1
0 已采纳 2021-06-23 01:49:47

解决方案2
0 2021-06-23 05:31:54

基于 pyspark 中的列值的舍入

问题描述

2 个解决方案

解决方案1 0 已采纳 2021-06-23 01:49:47

解决方案2 0 2021-06-23 05:31:54

解决方案1
0 已采纳 2021-06-23 01:49:47

解决方案2
0 2021-06-23 05:31:54