简体   繁体   English

Pyspark - 使用 dataframe 中其他两列的 RMSE 创建新列

[英]Pyspark - Create new column with the RMSE of two other columns in dataframe

I am fairly new to Pyspark.我对 Pyspark 相当陌生。 I have a dataframe, and I would like to create a 3rd column with the calculation for RMSE between col1 and col2 .我有一个 dataframe,我想创建一个第三列,计算col1col2之间的 RMSE。 I am using a user defined lambda function to make the RMSE calculation, but keep getting this error AttributeError: 'int' object has no attribute 'mean'我正在使用用户定义的 lambda function 进行 RMSE 计算,但不断收到此错误AttributeError: 'int' object has no attribute 'mean'

from pyspark.sql.functions import udf,col
from pyspark.sql.types import IntegerType
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext

spark = SparkSession.builder.config("spark.driver.memory", "30g").appName('linear_data_pipeline').getOrCreate()

sqlContext = SQLContext(sc)
old_df = sqlContext.createDataFrame(sc.parallelize(
    [(0, 1), (1, 3), (2, 5)]), ('col_1', 'col_2'))
function = udf(lambda col1, col2 : (((col1 - col2)**2).mean())**.5)
new_df = old_df.withColumn('col_n',function(col('col_1'), col('col_2')))
new_df.show()

How do I best go about fixing this issue?我如何最好地解决此问题? I would also like to find the RMSE/mean, mean absolute error, mean absolute error/mean, median absolute error, and Median Percent Error, but once I figure out how to calculate one, I should be good on the others.我还想找到 RMSE/均值、平均绝对误差、平均绝对误差/均值、中值绝对误差和中值百分比误差,但是一旦我弄清楚如何计算一个,我应该会擅长其他的。

I think than you are some confused.我觉得比你还有些糊涂。 The RMSE is calculated from a succession of points, therefor you don't must calculate this for each value in two columns. RMSE是根据一系列点计算得出的,因此您不必为两列中的每个值计算此值。 I think you have to calculate RMSE using all values in each column.我认为您必须使用每列中的所有值来计算 RMSE。

This could works:这可能有效:

pow = udf(lambda x: x**2)
rmse = (sum(pow(old_df['col1'] - old_df['col2']))/len(old_df))**.5
print(rmse)

I don't think you need a udf in that case.在这种情况下,我认为您不需要udf I think it is possible by using only pyspark.sql.functions .我认为仅使用pyspark.sql.functions是可能的。

I can propose you the following untested option我可以向您推荐以下未经测试的选项

import pyspark.sql.functions as psf

rmse = old_df.withColumn("squarederror",
                   psf.pow(psf.col("col1") - psf.col("col2"),
                           psf.lit(2)
                  ))
       .agg(psf.avg(psf.col("squarederror")).alias("mse"))
       .withColumn("rmse", psf.sqrt(psf.col("mse")))

rmse.collect()

Using the same logic, you can get other performance statistics使用相同的逻辑,您可以获得其他性能统计信息

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 dataframe 中其他两列的条件创建一个新列 - Create a new column using a condition from other two columns in a dataframe Pandas DataFrame 基于其他两列创建新的 csv 列 - Pandas DataFrame create new csv column based on two other columns 如何通过引用其他两列在 Python Dataframe 中创建新列? - How to create a new column in Python Dataframe by referencing two other columns? 如何为其他两列的每个组合创建一个带有新列的新数据框行? - How to create a new dataframe row with a new column for every combination of other two columns? 如何基于在PySpark中其他列中进行的计算来创建新列 - How to create a new column based on calculations made in other columns in PySpark 如何从数据框中的其他列创建新的Pandas数据框列 - How to create a new Pandas dataframe column from other columns in the dataframe 如何基于其他两列创建新的 dataframe 列? - How do I create a new dataframe column based on two other columns? 标记一个 pyspark 数据框列并在新列中添加 - tokenizing a pyspark dataframe column and stroing in new columns 将分隔列拆分为 pyspark dataframe 中的新列 - split delimited column into new columns in pyspark dataframe pyspark dataframe pivot 一个 json 列到新列 - pyspark dataframe pivot a json column to new columns
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM