Explaining my issue with an example..
table_name
col1 col2 col3
1 aaa xxx
1 bba yyy
2 ccc yyy
I want my final output to be like:
table_name
col1 col2 col3
1 aab xxx
1 bbc yyy
2 cc yyy
I have a table with 3 columns(has no primary key) and I had to change the data in the 2nd column among those 3 columns and now I want to replace the old column with this new 2nd column.
df1 = hc.sql("select col1 from table_name")
df2 = hc.sql("select col2 from table_name")
df3 = hc.sql("select col3 from table_name")
What are the ways to do this?
As there is no primary key i felt this is a way:
I can add row_numbers for the dataframes(row number will be a new column tot he dataframe) and can join all the dataframes on the row_number. -- here the issue is that when i "select col1 from table_name" and "select col2 from table_name" there is no gurantee that in the first and second queries i will be getting the rows in the same order.. (please correct me if I am wrong)
to make it simple I just want to do the following in pyspark:
table1
col_1
a
b
c
table2
col_2
d
e
f
into---
table_1_2
col1 col2
a d
b e
c f
you can add column to an existing dataframe with withColumn. you define the udf which will transform your column2 into the new column. can you describe the transformation you want to do ? So in your case :
table 1 = table1.withColumn("newCol2", udf(col2))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.