简体   繁体   中英

How to rename multiple column names as single column?

I have a table which has columns [col1, col2, col3.... col9]. I want to merge all the columns data into one column as col in python?

from pyspark.sql.functions import concat

values = [('A','B','C','D'),('E','F','G','H'),('I','J','K','L')]
df = sqlContext.createDataFrame(values,['col1','col2','col3','col4'])
df.show()

+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|   A|   B|   C|   D|
|   E|   F|   G|   H|
|   I|   J|   K|   L|
+----+----+----+----+

req_column = ['col1','col2','col3','col4']
df = df.withColumn('concatenated_cols',concat(*req_column))
df.show()

+----+----+----+----+-----------------+
|col1|col2|col3|col4|concatenated_cols|
+----+----+----+----+-----------------+
|   A|   B|   C|   D|             ABCD|
|   E|   F|   G|   H|             EFGH|
|   I|   J|   K|   L|             IJKL|
+----+----+----+----+-----------------+

using Spark SQL

new_df=sqlContext.sql("SELECT CONCAT(col1,col2,col3,col3) FROM df")

Using Non Spark SQL way you can use Concat function

new_df = df.withColumn('joined_column', concat(col('col1'),col('col2'),col('col3'),col('col4'))

In Spark(pySpark) for reasons, there is no edit of existing data. What you can do is create a new column. Please check the following link.

How do I add a new column to a Spark DataFrame (using PySpark)?

Using a UDF function , you can aggregate/combine all those values in a row and return you as a single value.

Few cautions, please look out for following data issues while aggregation

  1. Null values
  2. Type mismatches
  3. String Encoding issues

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM