在PySpark中加入多个数据框

Question

I have the following few data frames which have two columns each and have exactly the same number of rows. 我有以下几个数据框，每个数据框有两列，并具有完全相同的行数。 How do I join them so that I get a single data frame which has the two columns and all rows from both the data frames? 我如何加入它们以便获得一个数据帧，其中包含两列和来自两个数据帧的所有行？

For example: 例如：

DataFrame-1 数据帧1

+--------------+-------------+
| colS         |  label      |
+--------------+-------------+
| sample_0_URI |  0          |
| sample_0_URI |  0          |
+--------------+-------------+

DataFrame-2 数据帧2

+--------------+-------------+
| colS         |  label      |
+--------------+-------------+
| sample_1_URI |  1          |
| sample_1_URI |  1          |
+--------------+-------------+

DataFrame-3 数据帧3

+--------------+-------------+
| col1         |  label      |
+--------------+-------------+
| sample_2_URI |  2          |
| sample_2_URI |  2          |
+--------------+-------------+

DataFrame-4 数据帧4

+--------------+-------------+
| col1         |  label      |
+--------------+-------------+
| sample_3_URI |  3          |
| sample_3_URI |  3          |
+--------------+-------------+

... ...

I want the result of the join to be: 我希望连接的结果是：

+--------------+-------------+
| col1         |  label      |
+--------------+-------------+
| sample_0_URI |  0          |
| sample_0_URI |  0          |
| sample_1_URI |  1          |
| sample_1_URI |  1          |
| sample_2_URI |  2          |
| sample_2_URI |  2          |
| sample_3_URI |  3          |
| sample_3_URI |  3          |
+--------------+-------------+

Now, if I want to do one-hot encoding for label column, should it something like this: 现在，如果我想对标签列进行单热编码，那么应该是这样的：

oe = OneHotEncoder(inputCol="label",outputCol="one_hot_label")
df = oe.transform(df) # df is the joined dataframes <cols, label>

Answer 1

You are looking for union . 你正在寻找union 。

In this case, what I would do is put the dataframes in a list and use reduce : 在这种情况下，我要做的是将数据帧放在一个list并使用reduce ：

from functools import reduce

dataframes = [df_1, df_2, df_3, df_4]

result = reduce(lambda first, second: first.union(second), dataframes)

在PySpark中加入多个数据框

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-06-12 06:53:09

在PySpark中加入多个数据框

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-06-12 06:53:09

解决方案1
0 已采纳 2019-06-12 06:53:09