简体   繁体   中英

How to map a column to create a new column in spark sql dataframe?

In python and pandas, I can create a new column like this:

Using two columns in pandas dataframe to create a dict.

 dict1 = dict(zip(data["id"], data["duration"]))

Then I can apply this dict to create a new column in a second dataframe.

df['id_duration'] = df['id'].map(lambda x: dict1[x] if x in dict1.keys() else -1)

How can I create a new column id_duration in spark sql dataframe, in case I have a dataframe data (having two columns: id and duration ) and a dataframe df (having a column id )?

Using a dictionary would be a shame because you would need to collect the entire dataframe data onto the driver which will be very bad for performance and could cause an OOM error.

You could simply perform a left outer join between the two dataframes and use na.fill to fill empty values with -1 .

data = spark.createDataFrame([(1, 10), (2, 20), (3, 30)], ['id', 'duration'])
df = spark.createDataFrame([(1, 2), (3, 4)], ['id', 'x'])

df\
    .join(data.withColumnRenamed("duration", "id_duration"), ['id'], 'left')\
    .na.fill(-1).show()
+---+---+-----------+
| id|  x|id_duration|
+---+---+-----------+
|  5|  6|         -1|
|  1|  2|         10|
|  3|  4|         30|
+---+---+-----------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM