Adding a column to a dataframe in pyspark

Question

I want to add a new column to a dataframe based on the value of a existing column using pyspark .

For example, if this is the original dataframe, I want to add a new column called "parent's data", which contains the data of the parent based on the column "parent_id", so that the resulting dataframe looks like below.

Any help would be appreciated. Thank you.

Answer 1

I am sure there are multiple ways to achieve this. However, the simplest way is to create a new dataframe using 2 columns of existing dataframe. Then join the 2 dataframe to acieve this.

Here is the code

df1 = pd.DataFrame([[1, 'a', 2], [2, 'b', 3], [3, 'c', 1]], columns=["id", "data", "parent_id"])
print(df1)
sparkdf=spark.createDataFrame(df1)
sparkdf.show()
sparkdf2=sparkdf.select('id','data')
sparkdf2.show()
sparkdf.registerTempTable("sparkdf")
sparkdf2.registerTempTable("sparkdf2")

sparkdf3=spark.sql('select a.id,a.data,a.parent_id,b.data from sparkdf as a join sparkdf2 as b on a.parent_id=b.id')
sparkdf3.show()

Answer 2

You can create a dictionary from your two columns, id and data, and then add your new column using withColumn :

>>> d = {row["id"]:row["data"] for row in df.collect()}
Out[260]: {1: 'a', 2: 'b', 3: 'c'}

from itertools import chain
from pyspark.sql.functions import create_map, lit

m = create_map([lit(x) for x in chain(*d.items())])
df = df.withColumn('parent_data', m[df['parent_id']])

Which prints back:

>>> df.show(truncate=False)

+---+----+---------+-----------+
|id |data|parent_id|parent_data|
+---+----+---------+-----------+
|1  |a   |2        |b          |
|2  |b   |3        |c          |
|3  |c   |1        |a          |
+---+----+---------+-----------+

Adding a column to a dataframe in pyspark

Question

2 answers

solution1
1 ACCPTED 2021-05-11 06:59:34

solution2
0 2021-05-11 06:59:41

Adding a column to a dataframe in pyspark

Question

2 answers

solution1 1 ACCPTED 2021-05-11 06:59:34

solution2 0 2021-05-11 06:59:41

solution1
1 ACCPTED 2021-05-11 06:59:34

solution2
0 2021-05-11 06:59:41