[英]Pyspark join two dataframes
Assuming I have two dataframes with different levels of information like this:假设我有两个具有不同级别信息的数据框,如下所示:
df1
Month Day Values
Jan Monday 65
Feb Monday 66
Mar Tuesday 68
Jun Monday 58
df2
Month Day Hour
Jan Monday 5
Jan Monday 5
Jan Monday 8
Feb Monday 9
Feb Monday 9
Feb Monday 9
Mar Tuesday 10
Mar Tuesday 1
Jun Tuesday 2
Jun Monday 7
Jun Monday 8
I want to join df1 with df2 and transfer the 'Value' information to df2: Each hour of day will get the 'Day' value.我想加入 df1 和 df2 并将“值”信息传输到 df2:一天中的每个小时都将获得“天”值。
Expected output:预期输出:
final
Month Day Hour Value
Jan Monday 5 65
Jan Monday 5 65
Jan Monday 8 65
Feb Monday 9 66
Feb Monday 9 66
Feb Monday 9 66
Mar Tuesday 10 68
Mar Tuesday 1 68
Jun Monday 7 58
Jun Monday 8 58
This should be a simple join:这应该是一个简单的连接:
df2 = df2.join(df1, on=['Month', 'Day'], how='inner')
The join will calculate all possible combinations.联接将计算所有可能的组合。 Eg,
例如,
df1:
Jan Monday 65
df2:
Month Day Hour
Jan Monday 5
Jan Monday 5
Because all entries match on Jan
and Monday
all possible combinations will be part of the output:因为所有条目都在
Jan
和Monday
匹配,所以所有可能的组合都将成为输出的一部分:
Month Day Hour Value
Jan Monday 5 65
Jan Monday 5 65
Note: Whether you join df1
onto df2
or vice versa and whether you use an inner
or left
join depends on how you want to handle mismatches.注意:您是否将
df1
连接到df2
或反之亦然,以及使用inner
还是left
连接取决于您希望如何处理不匹配。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.