简体   繁体   中英

Join multiple data frame in PySpark

I have the following few data frames which have two columns each and have exactly the same number of rows. How do I join them so that I get a single data frame which has the two columns and all rows from both the data frames?

For example:

DataFrame-1

+--------------+-------------+
| colS         |  label      |
+--------------+-------------+
| sample_0_URI |  0          |
| sample_0_URI |  0          |
+--------------+-------------+

DataFrame-2

+--------------+-------------+
| colS         |  label      |
+--------------+-------------+
| sample_1_URI |  1          |
| sample_1_URI |  1          |
+--------------+-------------+

DataFrame-3

+--------------+-------------+
| col1         |  label      |
+--------------+-------------+
| sample_2_URI |  2          |
| sample_2_URI |  2          |
+--------------+-------------+

DataFrame-4

+--------------+-------------+
| col1         |  label      |
+--------------+-------------+
| sample_3_URI |  3          |
| sample_3_URI |  3          |
+--------------+-------------+

...

I want the result of the join to be:

+--------------+-------------+
| col1         |  label      |
+--------------+-------------+
| sample_0_URI |  0          |
| sample_0_URI |  0          |
| sample_1_URI |  1          |
| sample_1_URI |  1          |
| sample_2_URI |  2          |
| sample_2_URI |  2          |
| sample_3_URI |  3          |
| sample_3_URI |  3          |
+--------------+-------------+

Now, if I want to do one-hot encoding for label column, should it something like this:

oe = OneHotEncoder(inputCol="label",outputCol="one_hot_label")
df = oe.transform(df) # df is the joined dataframes <cols, label>

You are looking for union .

In this case, what I would do is put the dataframes in a list and use reduce :

from functools import reduce

dataframes = [df_1, df_2, df_3, df_4]

result = reduce(lambda first, second: first.union(second), dataframes)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM