I have the following few data frames which have two columns each and have exactly the same number of rows. How do I join them so that I get a single data frame which has the two columns and all rows from both the data frames?
For example:
DataFrame-1
+--------------+-------------+
| colS | label |
+--------------+-------------+
| sample_0_URI | 0 |
| sample_0_URI | 0 |
+--------------+-------------+
DataFrame-2
+--------------+-------------+
| colS | label |
+--------------+-------------+
| sample_1_URI | 1 |
| sample_1_URI | 1 |
+--------------+-------------+
DataFrame-3
+--------------+-------------+
| col1 | label |
+--------------+-------------+
| sample_2_URI | 2 |
| sample_2_URI | 2 |
+--------------+-------------+
DataFrame-4
+--------------+-------------+
| col1 | label |
+--------------+-------------+
| sample_3_URI | 3 |
| sample_3_URI | 3 |
+--------------+-------------+
...
I want the result of the join to be:
+--------------+-------------+
| col1 | label |
+--------------+-------------+
| sample_0_URI | 0 |
| sample_0_URI | 0 |
| sample_1_URI | 1 |
| sample_1_URI | 1 |
| sample_2_URI | 2 |
| sample_2_URI | 2 |
| sample_3_URI | 3 |
| sample_3_URI | 3 |
+--------------+-------------+
Now, if I want to do one-hot encoding for label column, should it something like this:
oe = OneHotEncoder(inputCol="label",outputCol="one_hot_label")
df = oe.transform(df) # df is the joined dataframes <cols, label>
You are looking for union
.
In this case, what I would do is put the dataframes in a list
and use reduce
:
from functools import reduce
dataframes = [df_1, df_2, df_3, df_4]
result = reduce(lambda first, second: first.union(second), dataframes)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.