[英]Pyspark: multiplying columns from different tables
I have these two dataframes:我有这两个数据框:
df1 = sc.parallelize([
['u1', 0.5],
['u2', 0.2],
['u3', 0.1],
['u4', 0.9],
['u5', 0.7]
]).toDF(('person', 'score'))
df2 = sc.parallelize([
['d1', 0.0],
['d2', 0.5],
['d3', 0.7],
]).toDF(('dog', 'score'))
What I need to do is creating another dataframe whose schema would be我需要做的是创建另一个数据框,其架构将是
person, dog, score_person * score_dog
人,狗,score_person * score_dog
so basically multiplying the column score
in both dataframes and keeping the two first columns.所以基本上将两个数据框中的列
score
相乘并保留前两列。 This multiplication has to take place for each possible couple of factors, ie each person with each dog, so that my result dataframe would have 15 rows.这种乘法必须针对每对可能的因素进行,即每个人和每只狗,这样我的结果数据框就有 15 行。
I can't find a way to obtain this, it seems to me that it has to pass through a SELECT on both dataframes but no JOIN nor UNION can help.我找不到获得它的方法,在我看来它必须通过两个数据帧上的 SELECT 但没有 JOIN 或 UNION 可以提供帮助。
通常笛卡尔乘积是要避免的,但简单的东西join
,而不on
参数是所有你需要在这里:
df1.join(df2).select("person", "dog", (df1.score * df2.score).alias("product"))
Looks like this question is a few years old but found an explicit Cross Join method added in version 2.1.看起来这个问题已经有几年了,但发现在 2.1 版中添加了一个明确的 Cross Join 方法。 Try:
尝试:
df1.crossJoin(df2).select("person", "dog", (df1.score * df2.score).alias("product"))
found information here: http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.crossJoin在这里找到信息: http : //spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.crossJoin
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.