How to map a column to create a new column in spark sql dataframe?

Question

In python and pandas, I can create a new column like this:

Using two columns in pandas dataframe to create a dict.

 dict1 = dict(zip(data["id"], data["duration"]))

Then I can apply this dict to create a new column in a second dataframe.

df['id_duration'] = df['id'].map(lambda x: dict1[x] if x in dict1.keys() else -1)

How can I create a new column id_duration in spark sql dataframe, in case I have a dataframe data (having two columns: id and duration ) and a dataframe df (having a column id )?

Answer 1

Using a dictionary would be a shame because you would need to collect the entire dataframe data onto the driver which will be very bad for performance and could cause an OOM error.

You could simply perform a left outer join between the two dataframes and use na.fill to fill empty values with -1 .

data = spark.createDataFrame([(1, 10), (2, 20), (3, 30)], ['id', 'duration'])
df = spark.createDataFrame([(1, 2), (3, 4)], ['id', 'x'])

df\
    .join(data.withColumnRenamed("duration", "id_duration"), ['id'], 'left')\
    .na.fill(-1).show()

+---+---+-----------+
| id|  x|id_duration|
+---+---+-----------+
|  5|  6|         -1|
|  1|  2|         10|
|  3|  4|         30|
+---+---+-----------+

How to map a column to create a new column in spark sql dataframe?

Question

1 answers

solution1
1 ACCPTED 2021-01-22 08:36:53

How to map a column to create a new column in spark sql dataframe?

Question

1 answers

solution1 1 ACCPTED 2021-01-22 08:36:53

solution1
1 ACCPTED 2021-01-22 08:36:53