如何在另一个数据帧上的UDF中引用数据帧？

Question

How do you reference a pyspark dataframe when in the execution of an UDF on another dataframe? 在另一个数据帧上执行UDF时，如何引用pyspark数据帧？

Here's a dummy example. 这是一个虚拟的例子。 I am creating two dataframes scores and lastnames , and within each lies a column that is the same across the two dataframes. 我创建了两个dataframes scores和lastnames ，并在各躺着一个列，它是在两个dataframes相同。 In the UDF applied on scores , I want to filter on lastnames and return a string found in lastname . 在应用于scores的UDF中，我想过滤lastnames并返回在lastname找到的字符串。

from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SQLContext
from pyspark.sql.types import *

sc = SparkContext("local")
sqlCtx = SQLContext(sc)


# Generate Random Data
import itertools
import random
student_ids = ['student1', 'student2', 'student3']
subjects = ['Math', 'Biology', 'Chemistry', 'Physics']
random.seed(1)
data = []

for (student_id, subject) in itertools.product(student_ids, subjects):
    data.append((student_id, subject, random.randint(0, 100)))

from pyspark.sql.types import StructType, StructField, IntegerType, StringType
schema = StructType([
            StructField("student_id", StringType(), nullable=False),
            StructField("subject", StringType(), nullable=False),
            StructField("score", IntegerType(), nullable=False)
    ])

# Create DataFrame 
rdd = sc.parallelize(data)
scores = sqlCtx.createDataFrame(rdd, schema)

# create another dataframe
last_name = ["Granger", "Weasley", "Potter"]
data2 = []
for i in range(len(student_ids)):
    data2.append((student_ids[i], last_name[i]))

schema = StructType([
            StructField("student_id", StringType(), nullable=False),
            StructField("last_name", StringType(), nullable=False)
    ])

rdd = sc.parallelize(data2)
lastnames = sqlCtx.createDataFrame(rdd, schema)


scores.show()
lastnames.show()


from pyspark.sql.functions import udf
def getLastName(sid):
    tmp_df = lastnames.filter(lastnames.student_id == sid)
    return tmp_df.last_name

getLastName_udf = udf(getLastName, StringType())
scores.withColumn("last_name", getLastName_udf("student_id")).show(10)

And the following is the last part of the trace: 以下是跟踪的最后一部分：

Py4JError: An error occurred while calling o114.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344)
    at py4j.Gateway.invoke(Gateway.java:252)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:745)

Answer 1

You can't directly reference a dataframe (or an RDD) from inside a UDF. 您无法直接从UDF内部引用数据框（或RDD）。 The DataFrame object is a handle on your driver that spark uses to represent the data and actions that will happen out on the cluster. DataFrame对象是驱动程序的句柄，spark用于表示将在集群上发生的数据和操作。 The code inside your UDF's will run out on the cluster at a time of Spark's choosing. 在Spark选择时，UDF中的代码将在集群上运行。 Spark does this by serializing that code, and making copies of any variables included in the closure and sending them out to each worker. Spark通过序列化该代码并复制包含在闭包中的任何变量并将它们发送给每个worker来完成此操作。

What instead you want to do, is use the constructs Spark provides in it's API to join/combine the two DataFrames. 你想要做的是使用Spark在其API中提供的构造来加入/组合两个DataFrame。 If one of the data sets is small, you can manually send out the data in a broadcast variable, and then access it from your UDF. 如果其中一个数据集很小，您可以手动发送广播变量中的数据，然后从UDF访问它。 Otherwise, you can just create the two dataframes like you did, then use the join operation to combine them. 否则，您可以像创建一样创建两个数据帧，然后使用连接操作来组合它们。 Something like this should work: 这样的事情应该有效：

joined = scores.withColumnRenamed("student_id", "join_id")
joined = joined.join(lastnames, joined.join_id == lastnames.student_id)\
               .drop("join_id")
joined.show()

+---------+-----+----------+---------+
|  subject|score|student_id|last_name|
+---------+-----+----------+---------+
|     Math|   13|  student1|  Granger|
|  Biology|   85|  student1|  Granger|
|Chemistry|   77|  student1|  Granger|
|  Physics|   25|  student1|  Granger|
|     Math|   50|  student2|  Weasley|
|  Biology|   45|  student2|  Weasley|
|Chemistry|   65|  student2|  Weasley|
|  Physics|   79|  student2|  Weasley|
|     Math|    9|  student3|   Potter|
|  Biology|    2|  student3|   Potter|
|Chemistry|   84|  student3|   Potter|
|  Physics|   43|  student3|   Potter|
+---------+-----+----------+---------+

It's also worth noting, that under the hood Spark DataFrames has an optimization where a DataFrame that is part of a join can be converted to a broadcast variable to avoid a shuffle if it is small enough. 值得注意的是，在引擎盖下Spark DataFrames有一个优化，其中作为连接一部分的DataFrame可以转换为广播变量，以避免在足够小的情况下进行随机播放。 So if you do the join method listed above, you should get the best possible performance, without sacrificing the ability to handle larger data sets. 因此，如果您执行上面列出的连接方法，您应该获得最佳性能，而不会牺牲处理更大数据集的能力。

Answer 2

Changing pair to dictionary for easy lookup of names 将对更改为字典以便于查找名称

data2 = {}
for i in range(len(student_ids)):
    data2[student_ids[i]] = last_name[i]

Instead of creating rdd and making it to df create broadcast variable 而不是创建rdd并使其成为df创建广播变量

//rdd = sc.parallelize(data2) 
//lastnames = sqlCtx.createDataFrame(rdd, schema)
lastnames = sc.broadcast(data2)

Now access this in udf with values attr on broadcast variable( lastnames ). 现在使用广播变量（ lastnames ）上的values attr在udf中访问它。

from pyspark.sql.functions import udf
def getLastName(sid):
    return lastnames.value[sid]

如何在另一个数据帧上的UDF中引用数据帧？

问题描述

2 个解决方案

解决方案1
6 2016-12-30 16:40:42

解决方案2
3 已采纳 2016-12-30 04:45:57

如何在另一个数据帧上的UDF中引用数据帧？

问题描述

2 个解决方案

解决方案1 6 2016-12-30 16:40:42

解决方案2 3 已采纳 2016-12-30 04:45:57

解决方案1
6 2016-12-30 16:40:42

解决方案2
3 已采纳 2016-12-30 04:45:57