在PySpark中加入多個數據框

Question

我有以下幾個數據框，每個數據框有兩列，並具有完全相同的行數。 我如何加入它們以便獲得一個數據幀，其中包含兩列和來自兩個數據幀的所有行？

例如：

數據幀1

+--------------+-------------+
| colS         |  label      |
+--------------+-------------+
| sample_0_URI |  0          |
| sample_0_URI |  0          |
+--------------+-------------+

數據幀2

+--------------+-------------+
| colS         |  label      |
+--------------+-------------+
| sample_1_URI |  1          |
| sample_1_URI |  1          |
+--------------+-------------+

數據幀3

+--------------+-------------+
| col1         |  label      |
+--------------+-------------+
| sample_2_URI |  2          |
| sample_2_URI |  2          |
+--------------+-------------+

數據幀4

+--------------+-------------+
| col1         |  label      |
+--------------+-------------+
| sample_3_URI |  3          |
| sample_3_URI |  3          |
+--------------+-------------+

...

我希望連接的結果是：

+--------------+-------------+
| col1         |  label      |
+--------------+-------------+
| sample_0_URI |  0          |
| sample_0_URI |  0          |
| sample_1_URI |  1          |
| sample_1_URI |  1          |
| sample_2_URI |  2          |
| sample_2_URI |  2          |
| sample_3_URI |  3          |
| sample_3_URI |  3          |
+--------------+-------------+

現在，如果我想對標簽列進行單熱編碼，那么應該是這樣的：

oe = OneHotEncoder(inputCol="label",outputCol="one_hot_label")
df = oe.transform(df) # df is the joined dataframes <cols, label>

Answer 1

你正在尋找union 。

在這種情況下，我要做的是將數據幀放在一個list並使用reduce ：

from functools import reduce

dataframes = [df_1, df_2, df_3, df_4]

result = reduce(lambda first, second: first.union(second), dataframes)

在PySpark中加入多個數據框

問題描述

1 個解決方案

解決方案1
0 已采納 2019-06-12 06:53:09

在PySpark中加入多個數據框

問題描述

1 個解決方案

解決方案1 0 已采納 2019-06-12 06:53:09

解決方案1
0 已采納 2019-06-12 06:53:09