[英]How to pass the column of a second dataframe into a UDF in PySpark 1.6.1
Here's what I'm trying to do. 这就是我想要做的。 I want to perform a comparison between each entry of two columns in two different dataframes.
我想在两个不同的数据帧中对两列的每个条目进行比较。 The dataframes are shown below:
数据框如下所示:
>>> subject_df.show()
+------+-------------+
|USERID| FULLNAME|
+------+-------------+
| 12345| steve james|
| 12346| steven smith|
| 43212|bill dunnigan|
+------+-------------+
>>> target_df.show()
+------+-------------+
|USERID| FULLNAME|
+------+-------------+
|111123| steve tyler|
|422226| linda smith|
|123333|bill dunnigan|
| 56453| steve smith|
+------+-------------+
Here is the logic I tried using: 这是我尝试使用的逻辑:
# CREATE FUNCTION
def string_match(subject, targets):
for target in targets:
<logic>
return logic_result
# CREATE UDF
string_match_udf = udf(string_match, IntegerType())
# APPLY UDF
subject_df.select(subject_df.FULLNAME, string_match_udf(subject_df.FULLNAME, target_df.FULLNAME).alias("score"))
This is the error I get when running the code in a pyspark shell: 这是我在pyspark shell中运行代码时遇到的错误:
py4j.protocol.Py4JJavaError: An error occurred while calling o45.select.
: java.lang.RuntimeException: Invalid PythonUDF PythonUDF#string_match(FULLNAME#2,FULLNAME#5), requires attributes from more than one child.
I think the root of my problem is trying to pass in the second column to the function. 我认为我的问题的根源是试图将第二列传递给函数。 Should I be using RDDs?
我应该使用RDD吗? Keep in mind that the actual subject_df and target_df are both over 100,000 rows.
请记住,实际的subject_df和target_df都超过100,000行。 I'm open to any advice.
我愿意接受任何建议。
It looks like you have a wrong idea how user defined functions works: 看起来您对用户定义的函数如何工作有错误的想法:
DataFame
. DataFame
的数据。 The only way to do what you want is to take a cartesian product. 做你想做的事的唯一方法是采取笛卡尔积。
subject_df.join(target_df).select(
f(subject_df.FULLNAME, target_df.FULLNAME)
)
where f
is a function that compares two elements at the time. 其中
f
是一个比较当时两个元素的函数。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.