[英]Join multiple data frame in PySpark
我有以下几个数据框,每个数据框有两列,并具有完全相同的行数。 我如何加入它们以便获得一个数据帧,其中包含两列和来自两个数据帧的所有行?
例如:
数据帧1
+--------------+-------------+
| colS | label |
+--------------+-------------+
| sample_0_URI | 0 |
| sample_0_URI | 0 |
+--------------+-------------+
数据帧2
+--------------+-------------+
| colS | label |
+--------------+-------------+
| sample_1_URI | 1 |
| sample_1_URI | 1 |
+--------------+-------------+
数据帧3
+--------------+-------------+
| col1 | label |
+--------------+-------------+
| sample_2_URI | 2 |
| sample_2_URI | 2 |
+--------------+-------------+
数据帧4
+--------------+-------------+
| col1 | label |
+--------------+-------------+
| sample_3_URI | 3 |
| sample_3_URI | 3 |
+--------------+-------------+
...
我希望连接的结果是:
+--------------+-------------+
| col1 | label |
+--------------+-------------+
| sample_0_URI | 0 |
| sample_0_URI | 0 |
| sample_1_URI | 1 |
| sample_1_URI | 1 |
| sample_2_URI | 2 |
| sample_2_URI | 2 |
| sample_3_URI | 3 |
| sample_3_URI | 3 |
+--------------+-------------+
现在,如果我想对标签列进行单热编码,那么应该是这样的:
oe = OneHotEncoder(inputCol="label",outputCol="one_hot_label")
df = oe.transform(df) # df is the joined dataframes <cols, label>
你正在寻找union
。
在这种情况下,我要做的是将数据帧放在一个list
并使用reduce
:
from functools import reduce
dataframes = [df_1, df_2, df_3, df_4]
result = reduce(lambda first, second: first.union(second), dataframes)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.