[英]String matching between two pyspark dataframe columns
我有一个 dataframe城市
country cities
UK [London,London Luton, Luton]
UK [London,London Gatwick, Gatwick]
和参考 dataframe机场:
city airport coords
London London Luton 12.51
London London Gatwick 100.32
我想将城市列中的值列表与参考 df 中的机场列相匹配。 如果匹配,则从参考 df 中获取相关的机场名称和坐标。
所需城市 df output 的样本:
country cities airport coords
UK [London,London Luton, Luton] London Luton 12.51
UK [London,London Gatwick, Gatwick] London Gatwick 100.32
解释:
[London,**London Luton**, Luton]
来自城市的匹配在机场与**London Luton**
我已经探索了一些选择,但无法真正实现。 谁能帮忙? 谢谢
您可以利用array_contains标记包含搜索字符串的行,最后过滤返回True
的行
d1 = {
'cities':[
['London','London Luton', 'Luton'],
['London','London Gatwick', 'Gatwick']
],
'country':['UK','UK']
}
d2 = {
'country':['UK','UK'],
'city':['London','London'],
'airport':['London Luton','London Gatwick'],
'coords':[12.51,100.32]
}
sparkDF1 = sql.createDataFrame(pd.DataFrame(d1))
sparkDF2 = sql.createDataFrame(pd.DataFrame(d2))
sparkDF1.show(truncate=False)
+---------------------------------+-------+
|cities |country|
+---------------------------------+-------+
|[London, London Luton, Luton] |UK |
|[London, London Gatwick, Gatwick]|UK |
+---------------------------------+-------+
sparkDF2.show()
+-------+------+--------------+------+
|country| city| airport|coords|
+-------+------+--------------+------+
| UK|London| London Luton| 12.51|
| UK|London|London Gatwick|100.32|
+-------+------+--------------+------+
finalDF = sparkDF1.join(sparkDF2
,sparkDF1['country'] == sparkDF2['country']
,'inner'
).select(sparkDF1['*'],sparkDF2['airport'])
finalDF = finalDF.withColumn('flag',F.array_contains( F.col('cities'),F.col('airport') ) )
finalDF.filter(F.col('flag') == True).show(truncate=False)
+---------------------------------+-------+--------------+----+
|cities |country|airport |flag|
+---------------------------------+-------+--------------+----+
|[London, London Luton, Luton] |UK |London Luton |true|
|[London, London Gatwick, Gatwick]|UK |London Gatwick|true|
+---------------------------------+-------+--------------+----+
您可以使用机场名称在城市表中创建一个新列。 然后您可以简单地合并airport
列中的两个表。
使用来自@Vaebhav 的准备核心:
sparkDF1 = sparkDF1.withColumn("airport", F.col("cities")[1])
sparkDF1.show(truncate=False)
+---------------------------------+-------+--------------+
|cities |country|airport |
+---------------------------------+-------+--------------+
|[London, London Luton, Luton] |UK |London Luton |
|[London, London Gatwick, Gatwick]|UK |London Gatwick|
+---------------------------------+-------+--------------+
finalDF = sparkDF1.join(sparkDF2, on="airport", how="right")
finalDF.show(truncate=False)
+--------------+---------------------------------+-------+-------+------+------+
|airport |cities |country|country|city |coords|
+--------------+---------------------------------+-------+-------+------+------+
|London Luton |[London, London Luton, Luton] |UK |UK |London|12.51 |
|London Gatwick|[London, London Gatwick, Gatwick]|UK |UK |London|100.32|
+--------------+---------------------------------+-------+-------+------+------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.