两个 pyspark dataframe 列之间的字符串匹配

Question

I have a dataframe cities我有一个 dataframe城市

country      cities
  UK        [London,London Luton, Luton]
  UK        [London,London Gatwick, Gatwick]

and reference dataframe airports :和参考 dataframe机场：

city         airport            coords
London        London Luton       12.51
London        London Gatwick     100.32

I want to match list of values in cities column with airport column from reference df.我想将城市列中的值列表与参考 df 中的机场列相匹配。 If it matches, get the relevance airport names and co ordinates from the reference df.如果匹配，则从参考 df 中获取相关的机场名称和坐标。

Sample of the desired output of cities df :所需城市 df output 的样本：

country      cities                                airport            coords
  UK        [London,London Luton, Luton]           London Luton       12.51
  UK        [London,London Gatwick, Gatwick]       London Gatwick     100.32

Explanation:解释：

[London,**London Luton**, Luton] from cities matches with **London Luton** in airport [London,**London Luton**, Luton]来自城市的匹配在机场与**London Luton**

I have explored some options but couldn't really get there.我已经探索了一些选择，但无法真正实现。 Can anyone help?谁能帮忙？ Thanks谢谢

Answer 1

You can utilise array_contains to mark rows which contain your search string and finally filter on the rows which returns True您可以利用array_contains标记包含搜索字符串的行，最后过滤返回True的行

Data Preparation数据准备

d1 = {
    'cities':[
                 ['London','London Luton', 'Luton'],
                 ['London','London Gatwick', 'Gatwick']
            ],
    'country':['UK','UK']
    
    
}

d2 = {
    'country':['UK','UK'],
    'city':['London','London'],
    'airport':['London Luton','London Gatwick'],
    'coords':[12.51,100.32]
}


sparkDF1 = sql.createDataFrame(pd.DataFrame(d1))
sparkDF2 = sql.createDataFrame(pd.DataFrame(d2))

sparkDF1.show(truncate=False)

+---------------------------------+-------+
|cities                           |country|
+---------------------------------+-------+
|[London, London Luton, Luton]    |UK     |
|[London, London Gatwick, Gatwick]|UK     |
+---------------------------------+-------+

sparkDF2.show()

+-------+------+--------------+------+
|country|  city|       airport|coords|
+-------+------+--------------+------+
|     UK|London|  London Luton| 12.51|
|     UK|London|London Gatwick|100.32|
+-------+------+--------------+------+

Array Contains数组包含

finalDF = sparkDF1.join(sparkDF2
                       ,sparkDF1['country'] == sparkDF2['country']
                       ,'inner'
                    ).select(sparkDF1['*'],sparkDF2['airport'])


finalDF = finalDF.withColumn('flag',F.array_contains( F.col('cities'),F.col('airport') ) )


finalDF.filter(F.col('flag') == True).show(truncate=False)

+---------------------------------+-------+--------------+----+
|cities                           |country|airport       |flag|
+---------------------------------+-------+--------------+----+
|[London, London Luton, Luton]    |UK     |London Luton  |true|
|[London, London Gatwick, Gatwick]|UK     |London Gatwick|true|
+---------------------------------+-------+--------------+----+

Answer 2

You could create a new column on the cities table with the airport name.您可以使用机场名称在城市表中创建一个新列。 Then you could simply merge the two tables on the airport column.然后您可以简单地合并airport列中的两个表。

Using the prep core from @Vaebhav:使用来自@Vaebhav 的准备核心：

sparkDF1 = sparkDF1.withColumn("airport", F.col("cities")[1])

sparkDF1.show(truncate=False)

+---------------------------------+-------+--------------+
|cities                           |country|airport       |
+---------------------------------+-------+--------------+
|[London, London Luton, Luton]    |UK     |London Luton  |
|[London, London Gatwick, Gatwick]|UK     |London Gatwick|
+---------------------------------+-------+--------------+


finalDF = sparkDF1.join(sparkDF2, on="airport", how="right")

finalDF.show(truncate=False)

+--------------+---------------------------------+-------+-------+------+------+
|airport       |cities                           |country|country|city  |coords|
+--------------+---------------------------------+-------+-------+------+------+
|London Luton  |[London, London Luton, Luton]    |UK     |UK     |London|12.51 |
|London Gatwick|[London, London Gatwick, Gatwick]|UK     |UK     |London|100.32|
+--------------+---------------------------------+-------+-------+------+------+

两个 pyspark dataframe 列之间的字符串匹配

问题描述

2 个解决方案

解决方案1
0 2022-03-02 14:21:42

Data Preparation数据准备

Array Contains数组包含

解决方案2
0 2022-03-02 15:34:38

两个 pyspark dataframe 列之间的字符串匹配

问题描述

2 个解决方案

解决方案1 0 2022-03-02 14:21:42

Data Preparation数据准备

Array Contains数组包含

解决方案2 0 2022-03-02 15:34:38

解决方案1
0 2022-03-02 14:21:42

解决方案2
0 2022-03-02 15:34:38