简体   繁体   English

Pyspark:将不同表中的列相乘

[英]Pyspark: multiplying columns from different tables

I have these two dataframes:我有这两个数据框:

df1 = sc.parallelize([
['u1', 0.5],
['u2', 0.2],
['u3', 0.1],
['u4', 0.9],
['u5', 0.7]
]).toDF(('person', 'score'))

df2 = sc.parallelize([
['d1', 0.0],
['d2', 0.5],
['d3', 0.7],
]).toDF(('dog', 'score'))

What I need to do is creating another dataframe whose schema would be我需要做的是创建另一个数据框,其架构将是

person, dog, score_person * score_dog人,狗,score_person * score_dog

so basically multiplying the column score in both dataframes and keeping the two first columns.所以基本上将两个数据框中的列score相乘并保留前两列。 This multiplication has to take place for each possible couple of factors, ie each person with each dog, so that my result dataframe would have 15 rows.这种乘法必须针对每对可能的因素进行,即每个人和每只狗,这样我的结果数据框就有 15 行。

I can't find a way to obtain this, it seems to me that it has to pass through a SELECT on both dataframes but no JOIN nor UNION can help.我找不到获得它的方法,在我看来它必须通过两个数据帧上的 SELECT 但没有 JOIN 或 UNION 可以提供帮助。

通常笛卡尔乘积是要避免的,但简单的东西join ,而不on参数是所有你需要在这里:

df1.join(df2).select("person", "dog", (df1.score * df2.score).alias("product"))

Looks like this question is a few years old but found an explicit Cross Join method added in version 2.1.看起来这个问题已经有几年了,但发现在 2.1 版中添加了一个明确的 Cross Join 方法。 Try:尝试:

df1.crossJoin(df2).select("person", "dog", (df1.score * df2.score).alias("product"))

found information here: http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.crossJoin在这里找到信息: http : //spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.crossJoin

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从两个不同的表中乘以 SQLite3 中的列 - Multiplying columns in SQLite3 from two different tables 从 pandas 中的不同数据帧乘以这两列时得到 NaN - Getting NaN when multiplying these two columns from different dataframes in pandas Pyspark - 从两个不同的数据帧中减去列 - Pyspark - Subtract columns from two different dataframes 比较来自两个不同 pyspark 数据帧的两个不同列 - Compare two different columns from two different pyspark dataframe 通过将来自具有相同索引的不同数据帧的两列相乘来添加新列 - Adding a new column by multiplying two columns from different dataframes with the same index 如何合并 python 上不同表中的列 - How to merge columns from different tables on python 从不同的数据帧将列字典转换为 Dataframe:pyspark - Convert dictionary of columns to Dataframe in from different dataframes : pyspark Pyspark:匹配来自两个不同数据帧的列并添加值 - Pyspark: match columns from two different dataframes and add value Pyspark-根据来自不同数据框的值向数据框添加列 - Pyspark - add columns to dataframe based on values from different dataframe 根据某些条件从不同的 pyspark 列中提取所有匹配项 - Extracting all matches from different pyspark columns depending on some condition
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM