[英]round based on column value in pyspark
I need to round off summary_measure_value based on reading_precision value我需要根据 reading_precision 值四舍五入summary_measure_value
from pyspark.sql.functions import *
import pyspark.sql.functions as F
from pyspark.sql import *
df = spark.createDataFrame(
[(123, 2897402, 43.25, 2),
(124, 2897402, 49.25, 0),
(125, 2897402, 43.25, 2),
(126, 2897402, 48.75, 0)]
, ['model_id','lab_test_id','summary_measure_value','reading_precision'])
partition_by_reading = [
"model_id",
"lab_test_id"
]
df.withColumn(
"reading_value",
round(avg("summary_measure_value").over(
Window.partitionBy(partition_by_reading))
,col("reading_precision"))).show()
I'm getting TypeError: 'Column' object is not callable我收到 TypeError: 'Column' 对象不可调用
The pyspark round
function expects a constant value for the scale/precision. pyspark
round
函数需要一个恒定的比例/精度值。 You may have better luck creating a custom udf that applies the rounding logic.您可能会更幸运地创建一个应用舍入逻辑的自定义 udf。
I've included an example below from a test I've done based on your shared example :我在下面根据您的共享示例所做的测试中包含了一个示例:
udf_round = F.udf(lambda val,precision: round(val,precision))
df.withColumn(
"reading_value",
udf_round(F.avg("summary_measure_value").over(
Window.partitionBy(partition_by_reading))
,F.col("reading_precision"))).show()
NB.注意。
round
in the udf above refers to the built-in python round
function.上面udf中的
round
是指python内置的round
函数。
result:结果:
model_id![]() |
lab_test_id ![]() |
summary_measure_value ![]() |
reading_precision![]() |
reading_value![]() |
---|---|---|---|---|
125 ![]() |
2897402 ![]() |
43.25 ![]() |
2 ![]() |
43.25 ![]() |
124 ![]() |
2897402 ![]() |
49.25 ![]() |
0 ![]() |
49.0 ![]() |
123 ![]() |
2897402 ![]() |
43.25 ![]() |
2 ![]() |
43.25 ![]() |
126 ![]() |
2897402 ![]() |
48.75 ![]() |
0 ![]() |
49.0 ![]() |
Edit 1:编辑1:
If you have erroneous data the following udf is more robust如果你有错误的数据,下面的 udf 更健壮
@udf
def udf_round(val,precision) -> float:
try:
# ensure that value is a float and precision is an integer
return float(round(float(val),int(precision)))
except:
return val # return val if there are any errors
Out of curiosity, applied a pandas_udf iterating multiple series.出于好奇,应用了迭代多个系列的 pandas_udf。 Seems faster than I thought
似乎比我想象的要快
from pyspark.sql.functions import *
import pyspark.sql.functions as F
from pyspark.sql import *
from typing import Iterator, Tuple
from pyspark.sql.functions import struct, col
@pandas_udf('double')
def round_series(iterator: Iterator[Tuple[pd.Series, pd.DataFrame]]) -> Iterator[pd.Series]:
return(a.round(b) for a,b in iterator)
df1=df.withColumn("reading_value",F.avg("summary_measure_value").over( Window.partitionBy("model_id", "lab_test_id"))).withColumn("reading_value",round_series('summary_measure_value','reading_precision')).show()
+--------+-----------+---------------------+-----------------+-------------+
|model_id|lab_test_id|summary_measure_value|reading_precision|reading_value|
+--------+-----------+---------------------+-----------------+-------------+
| 123| 2897402| 43.25| 2| 43.25|
| 124| 2897402| 49.11| 1| 49.1|
| 125| 2897402| 43.25| 2| 43.25|
| 126| 2897402| 48.75| 0| 49.0|
+--------+-----------+---------------------+-----------------+-------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.