简体   繁体   中英

Spark: merge two dataframes

Explaining my issue with an example..

table_name

col1 col2 col3
1     aaa  xxx
1     bba  yyy
2     ccc  yyy

I want my final output to be like:

table_name

col1 col2 col3
1     aab  xxx
1     bbc  yyy
2     cc   yyy

I have a table with 3 columns(has no primary key) and I had to change the data in the 2nd column among those 3 columns and now I want to replace the old column with this new 2nd column.

df1 = hc.sql("select col1 from table_name")
df2 = hc.sql("select col2 from table_name")
df3 = hc.sql("select col3 from table_name")

What are the ways to do this?

As there is no primary key i felt this is a way:

I can add row_numbers for the dataframes(row number will be a new column tot he dataframe) and can join all the dataframes on the row_number. -- here the issue is that when i "select col1 from table_name" and "select col2 from table_name" there is no gurantee that in the first and second queries i will be getting the rows in the same order.. (please correct me if I am wrong)

to make it simple I just want to do the following in pyspark:

table1

col_1
a
b
c

table2

col_2
d
e
f

into---

table_1_2

col1 col2
a    d
b    e
c    f

you can add column to an existing dataframe with withColumn. you define the udf which will transform your column2 into the new column. can you describe the transformation you want to do ? So in your case :

table 1 = table1.withColumn("newCol2", udf(col2))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM