简体   繁体   English

基于 pyspark 中的列值的舍入

[英]round based on column value in pyspark

I need to round off summary_measure_value based on reading_precision value我需要根据 reading_precision 值四舍五入summary_measure_value

from pyspark.sql.functions import *
import pyspark.sql.functions as F
from pyspark.sql import *

df = spark.createDataFrame(
[(123, 2897402, 43.25, 2),
(124, 2897402, 49.25, 0),
(125, 2897402, 43.25, 2), 
(126, 2897402, 48.75, 0)]
, ['model_id','lab_test_id','summary_measure_value','reading_precision'])


partition_by_reading = [
    "model_id",
    "lab_test_id"
]
df.withColumn(
        "reading_value",
        round(avg("summary_measure_value").over(
                    Window.partitionBy(partition_by_reading))
                ,col("reading_precision"))).show()

I'm getting TypeError: 'Column' object is not callable我收到 TypeError: 'Column' 对象不可调用

The pyspark round function expects a constant value for the scale/precision. pyspark round函数需要一个恒定的比例/精度值。 You may have better luck creating a custom udf that applies the rounding logic.您可能会更幸运地创建一个应用舍入逻辑的自定义 udf。

I've included an example below from a test I've done based on your shared example :我在下面根据您的共享示例所做的测试中包含了一个示例:

udf_round = F.udf(lambda val,precision: round(val,precision))
df.withColumn(
        "reading_value",
        udf_round(F.avg("summary_measure_value").over(
                    Window.partitionBy(partition_by_reading))
                ,F.col("reading_precision"))).show()

NB.注意。round in the udf above refers to the built-in python round function.上面udf中的round是指python内置的round函数。

result:结果:

model_id型号标识 lab_test_id lab_test_id summary_measure_value summary_measure_value reading_precision读数精度 reading_value读数值
125 125 2897402 2897402 43.25 43.25 2 2 43.25 43.25
124 124 2897402 2897402 49.25 49.25 0 0 49.0 49.0
123 123 2897402 2897402 43.25 43.25 2 2 43.25 43.25
126 126 2897402 2897402 48.75 48.75 0 0 49.0 49.0

Edit 1:编辑1:

If you have erroneous data the following udf is more robust如果你有错误的数据,下面的 udf 更健壮

@udf
def udf_round(val,precision) -> float:
    try:
        # ensure that value is a float and precision is an integer
        return float(round(float(val),int(precision)))
    except:
        return val # return val if there are any errors

Out of curiosity, applied a pandas_udf iterating multiple series.出于好奇,应用了迭代多个系列的 pandas_udf。 Seems faster than I thought似乎比我想象的要快

from pyspark.sql.functions import *
import pyspark.sql.functions as F
from pyspark.sql import *
from typing import Iterator, Tuple
from pyspark.sql.functions import struct, col


@pandas_udf('double')
def round_series(iterator: Iterator[Tuple[pd.Series, pd.DataFrame]]) -> Iterator[pd.Series]:
    return(a.round(b) for a,b in iterator)

df1=df.withColumn("reading_value",F.avg("summary_measure_value").over( Window.partitionBy("model_id", "lab_test_id"))).withColumn("reading_value",round_series('summary_measure_value','reading_precision')).show()

+--------+-----------+---------------------+-----------------+-------------+
|model_id|lab_test_id|summary_measure_value|reading_precision|reading_value|
+--------+-----------+---------------------+-----------------+-------------+
|     123|    2897402|                43.25|                2|        43.25|
|     124|    2897402|                49.11|                1|         49.1|
|     125|    2897402|                43.25|                2|        43.25|
|     126|    2897402|                48.75|                0|         49.0|
+--------+-----------+---------------------+-----------------+-------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM