Pyspark：将不同表中的列相乘

Question

I have these two dataframes:我有这两个数据框：

df1 = sc.parallelize([
['u1', 0.5],
['u2', 0.2],
['u3', 0.1],
['u4', 0.9],
['u5', 0.7]
]).toDF(('person', 'score'))

df2 = sc.parallelize([
['d1', 0.0],
['d2', 0.5],
['d3', 0.7],
]).toDF(('dog', 'score'))

What I need to do is creating another dataframe whose schema would be我需要做的是创建另一个数据框，其架构将是

person, dog, score_person * score_dog人，狗，score_person * score_dog

so basically multiplying the column score in both dataframes and keeping the two first columns.所以基本上将两个数据框中的列score相乘并保留前两列。 This multiplication has to take place for each possible couple of factors, ie each person with each dog, so that my result dataframe would have 15 rows.这种乘法必须针对每对可能的因素进行，即每个人和每只狗，这样我的结果数据框就有 15 行。

I can't find a way to obtain this, it seems to me that it has to pass through a SELECT on both dataframes but no JOIN nor UNION can help.我找不到获得它的方法，在我看来它必须通过两个数据帧上的 SELECT 但没有 JOIN 或 UNION 可以提供帮助。

Answer 1

通常笛卡尔乘积是要避免的，但简单的东西join ，而不on参数是所有你需要在这里：

df1.join(df2).select("person", "dog", (df1.score * df2.score).alias("product"))

Answer 2

Looks like this question is a few years old but found an explicit Cross Join method added in version 2.1.看起来这个问题已经有几年了，但发现在 2.1 版中添加了一个明确的 Cross Join 方法。 Try:尝试：

df1.crossJoin(df2).select("person", "dog", (df1.score * df2.score).alias("product"))

found information here: http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.crossJoin在这里找到信息： http : //spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.crossJoin

Pyspark：将不同表中的列相乘

问题描述

2 个解决方案

解决方案1
6 已采纳

解决方案2
1 2020-02-26 18:50:38

Pyspark：将不同表中的列相乘

问题描述

2 个解决方案

解决方案1 6 已采纳

解决方案2 1 2020-02-26 18:50:38

解决方案1
6 已采纳

解决方案2
1 2020-02-26 18:50:38