pyspark数据框上的自定义功能

Question

I'm trying to apply a custom function over rows in a pyspark dataframe. 我试图在pyspark数据框中的行上应用自定义函数。 This function takes the row and 2 other vectors of the same dimension. 此函数采用相同尺寸的行和2个其他向量。 It outputs the sum of the values of the third vector for each matching values from the row in the second vector. 对于第二个向量中的行，它为每个匹配值输出第三个向量的值之和。

import pandas as pd
import numpy as np

Function: 功能：

def V_sum(row,b,c):
    return float(np.sum(c[row==b]))

What I want to achieve is simple with pandas: 我想用熊猫实现简单：

pd_df = pd.DataFrame([[0,1,0,0],[1,1,0,0],[0,0,1,0],[1,0,1,1],[1,1,0,0]], columns=['t1', 't2', 't3', 't4'])
   t1  t2  t3  t4
0   0   1   0   0
1   1   1   0   0
2   0   0   1   0
3   1   0   1   1
4   1   1   0   0

B = np.array([1,0,1,0])
V = np.array([5,1,2,4])

pd_df.apply(lambda x: V_sum(x, B, V), axis=1)
0    4.0
1    9.0
2    7.0
3    8.0
4    9.0
dtype: int64

I would like to perform the same action in pyspark. 我想在pyspark中执行相同的操作。

from pyspark import SparkConf, SparkContext, SQLContext
sc = SparkContext("local")
sqlContext = SQLContext(sc)

spk_df = sqlContext.createDataFrame([[0,1,0,0],[1,1,0,0],[0,0,1,0],[1,0,1,1],[1,1,0,0]], ['t1', 't2', 't3', 't4'])
spk_df.show()
+---+---+---+---+
| t1| t2| t3| t4|
+---+---+---+---+
|  0|  1|  0|  0|
|  1|  1|  0|  0|
|  0|  0|  1|  0|
|  1|  0|  1|  1|
|  1|  1|  0|  0|
+---+---+---+---+

I thought about using udf but I can't get it to work 我考虑过使用udf，但无法正常工作

from pyspark.sql.types import FloatType
import pyspark.sql.functions as F

V_sum_udf = F.udf(V_sum, FloatType()) 
spk_df.select(V_sum_udf(F.array(*(F.col(x) for x in spk_df.columns))).alias("results")).show()

Clearly I'm doing something wrong because it yields: 显然我做错了，因为它产生了：

Py4JJavaError: An error occurred while calling o27726.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 90.0 failed 1 times, most recent failure: Lost task 0.0 in stage 90.0 (TID 91, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):

Answer 1

If you've got non-column data that you want to use inside a function along with column data to compute a new column, a UDF + closure + withColumn as described here is a good place to start. 如果您要在函数中使用非列数据以及列数据来计算新列，那么此处所述的UDF +闭包+ withColumn是一个不错的起点。

B = [2,0,1,0] 
V = [5,1,2,4]

v_sum_udf = F.udf(lambda row: V_sum(row, B, V), FloatType())
spk_df.withColumn("results", v_sum_udf(F.array(*(F.col(x) for x in spk_df.columns))))

pyspark数据框上的自定义功能

问题描述

1 个解决方案

解决方案1
0 已采纳 2017-12-02 10:34:39

pyspark数据框上的自定义功能

问题描述

1 个解决方案

解决方案1 0 已采纳 2017-12-02 10:34:39

解决方案1
0 已采纳 2017-12-02 10:34:39