In python and pandas, I can create a new column like this:
Using two columns in pandas dataframe to create a dict.
dict1 = dict(zip(data["id"], data["duration"]))
Then I can apply this dict to create a new column in a second dataframe.
df['id_duration'] = df['id'].map(lambda x: dict1[x] if x in dict1.keys() else -1)
How can I create a new column id_duration
in spark sql dataframe, in case I have a dataframe data
(having two columns: id
and duration
) and a dataframe df
(having a column id
)?
Using a dictionary would be a shame because you would need to collect the entire dataframe data
onto the driver which will be very bad for performance and could cause an OOM error.
You could simply perform a left outer join between the two dataframes and use na.fill
to fill empty values with -1
.
data = spark.createDataFrame([(1, 10), (2, 20), (3, 30)], ['id', 'duration'])
df = spark.createDataFrame([(1, 2), (3, 4)], ['id', 'x'])
df\
.join(data.withColumnRenamed("duration", "id_duration"), ['id'], 'left')\
.na.fill(-1).show()
+---+---+-----------+
| id| x|id_duration|
+---+---+-----------+
| 5| 6| -1|
| 1| 2| 10|
| 3| 4| 30|
+---+---+-----------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.