使用数组值合并两个 Spark 数据帧

Question

我有两个 Spark 数据框，如下所示：

> cities_df

+----------+---------------------------+
|   city_id|                     cities|           
+----------+---------------------------+
|       22 |[Milan, Turin, Rome]       |
+----------+---------------------------+
|       15 |[Naples, Florence, Genoa]  |
+----------+---------------------------+
|       43 |[Houston, San Jose, Boston]|
+----------+---------------------------+
|       56 |[New York, Dallas, Chicago]|
+----------+---------------------------+


> countries_df

+----------+----------------------------------+
|country_id|                         countries|           
+----------+----------------------------------+
|      680 |{'country': [56, 43], 'add': []}  |
+----------+----------------------------------+
|       11 |{'country': [22, 15], 'add': [32]}|
+----------+----------------------------------+

country_df 中的countries_df /地区值是cities_df数据框中的城市 ID。

我需要合并这些数据框以将country /地区的城市 ID 替换为其来自cities_df数据框的值。

预期输出：

country_id	国家	分组城市
680	{'国家'：[56, 43]，'添加'：[]}	[纽约、达拉斯、芝加哥、休斯顿、圣何塞、波士顿]
11	{'国家'：[22, 15]，'添加'：[32]}	[米兰、都灵、罗马、那不勒斯、佛罗伦萨、热那亚]

获得grouped_cities值不一定是数组类型，可以是字符串。

如何使用 PySpark 获得此结果？

Answer 1

输入：

from pyspark.sql import functions as F
cities_df = spark.createDataFrame(
    [(22, ['Milan', 'Turin', 'Rome']),
     (15, ['Naples', 'Florence', 'Genoa']),
     (43, ['Houston', 'San Jose', 'Boston']),
     (56, ['New York', 'Dallas', 'Chicago'])],
    ['city_id', 'cities']
)
countries_df = spark.createDataFrame(
    [(680, {'country': [56, 43], 'add': []}),
     (11, {'country': [22, 15], 'add': [32]})],
    ['country_id', 'countries']
)

脚本：

df_expl = countries_df.withColumn('city_id', F.explode('countries.country'))
df_joined = df_expl.join(cities_df, 'city_id', 'left')
df = df_joined.groupBy('country_id').agg(
    F.first('countries').alias('countries'),
    F.flatten(F.collect_list('cities')).alias('grouped_cities')
)
df.show(truncate=0)
# +----------+----------------------------------+------------------------------------------------------+
# |country_id|countries                         |grouped_cities                                        |
# +----------+----------------------------------+------------------------------------------------------+
# |11        |{add -> [32], country -> [22, 15]}|[Naples, Florence, Genoa, Milan, Turin, Rome]         |
# |680       |{add -> [], country -> [56, 43]}  |[Houston, San Jose, Boston, New York, Dallas, Chicago]|
# +----------+----------------------------------+------------------------------------------------------+

Answer 2

另一种方法。 使用 select 在 countries_df 上创建一个新列。 Groupby 使用 country_id，并将国家列转换为字符串。 代码如下。

new =cities_df.join(countries_df.select('*',explode('countries.country').alias('city_id')), how='left', on='city_id').groupby('country_id',col('countries').cast('string').alias('countries')).agg(flatten(collect_set('cities')).alias('cities')).show(truncate=False)


+----------+----------------------------------+------------------------------------------------------+
|country_id|countries                         |cities                                                |
+----------+----------------------------------+------------------------------------------------------+
|11        |{add -> [32], country -> [22, 15]}|[Milan, Turin, Rome, Naples, Florence, Genoa]         |
|680       |{add -> [], country -> [56, 43]}  |[New York, Dallas, Chicago, Houston, San Jose, Boston]|
+----------+----------------------------------+------------------------------------------------------+

使用数组值合并两个 Spark 数据帧

问题描述

2 个解决方案

解决方案1
2 已采纳 2022-05-24 19:12:29

解决方案2
2 2022-05-24 23:37:02

使用数组值合并两个 Spark 数据帧

问题描述

2 个解决方案

解决方案1 2 已采纳 2022-05-24 19:12:29

解决方案2 2 2022-05-24 23:37:02

解决方案1
2 已采纳 2022-05-24 19:12:29

解决方案2
2 2022-05-24 23:37:02