繁体   English   中英

两个 pyspark dataframe 列之间的字符串匹配

[英]String matching between two pyspark dataframe columns

我有一个 dataframe城市

country      cities
  UK        [London,London Luton, Luton]
  UK        [London,London Gatwick, Gatwick]

和参考 dataframe机场

city         airport            coords
London        London Luton       12.51
London        London Gatwick     100.32

我想将城市列中的值列表与参考 df 中的机场列相匹配。 如果匹配,则从参考 df 中获取相关的机场名称和坐标。

所需城市 df output 的样本:

country      cities                                airport            coords
  UK        [London,London Luton, Luton]           London Luton       12.51
  UK        [London,London Gatwick, Gatwick]       London Gatwick     100.32

解释:

[London,**London Luton**, Luton]来自城市的匹配在机场**London Luton**

我已经探索了一些选择,但无法真正实现。 谁能帮忙? 谢谢

您可以利用array_contains标记包含搜索字符串的行,最后过滤返回True的行

数据准备

d1 = {
    'cities':[
                 ['London','London Luton', 'Luton'],
                 ['London','London Gatwick', 'Gatwick']
            ],
    'country':['UK','UK']
    
    
}

d2 = {
    'country':['UK','UK'],
    'city':['London','London'],
    'airport':['London Luton','London Gatwick'],
    'coords':[12.51,100.32]
}


sparkDF1 = sql.createDataFrame(pd.DataFrame(d1))
sparkDF2 = sql.createDataFrame(pd.DataFrame(d2))

sparkDF1.show(truncate=False)

+---------------------------------+-------+
|cities                           |country|
+---------------------------------+-------+
|[London, London Luton, Luton]    |UK     |
|[London, London Gatwick, Gatwick]|UK     |
+---------------------------------+-------+

sparkDF2.show()

+-------+------+--------------+------+
|country|  city|       airport|coords|
+-------+------+--------------+------+
|     UK|London|  London Luton| 12.51|
|     UK|London|London Gatwick|100.32|
+-------+------+--------------+------+

数组包含

finalDF = sparkDF1.join(sparkDF2
                       ,sparkDF1['country'] == sparkDF2['country']
                       ,'inner'
                    ).select(sparkDF1['*'],sparkDF2['airport'])


finalDF = finalDF.withColumn('flag',F.array_contains( F.col('cities'),F.col('airport') ) )


finalDF.filter(F.col('flag') == True).show(truncate=False)

+---------------------------------+-------+--------------+----+
|cities                           |country|airport       |flag|
+---------------------------------+-------+--------------+----+
|[London, London Luton, Luton]    |UK     |London Luton  |true|
|[London, London Gatwick, Gatwick]|UK     |London Gatwick|true|
+---------------------------------+-------+--------------+----+

您可以使用机场名称在城市表中创建一个新列。 然后您可以简单地合并airport列中的两个表。

使用来自@Vaebhav 的准备核心:

sparkDF1 = sparkDF1.withColumn("airport", F.col("cities")[1])

sparkDF1.show(truncate=False)

+---------------------------------+-------+--------------+
|cities                           |country|airport       |
+---------------------------------+-------+--------------+
|[London, London Luton, Luton]    |UK     |London Luton  |
|[London, London Gatwick, Gatwick]|UK     |London Gatwick|
+---------------------------------+-------+--------------+


finalDF = sparkDF1.join(sparkDF2, on="airport", how="right")

finalDF.show(truncate=False)

+--------------+---------------------------------+-------+-------+------+------+
|airport       |cities                           |country|country|city  |coords|
+--------------+---------------------------------+-------+-------+------+------+
|London Luton  |[London, London Luton, Luton]    |UK     |UK     |London|12.51 |
|London Gatwick|[London, London Gatwick, Gatwick]|UK     |UK     |London|100.32|
+--------------+---------------------------------+-------+-------+------+------+

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM