[英]String matching between two pyspark dataframe columns
I have a dataframe cities我有一个 dataframe城市
country cities
UK [London,London Luton, Luton]
UK [London,London Gatwick, Gatwick]
and reference dataframe airports :和参考 dataframe机场:
city airport coords
London London Luton 12.51
London London Gatwick 100.32
I want to match list of values in cities column with airport column from reference df.我想将城市列中的值列表与参考 df 中的机场列相匹配。 If it matches, get the relevance airport names and co ordinates from the reference df.如果匹配,则从参考 df 中获取相关的机场名称和坐标。
Sample of the desired output of cities df :所需城市 df output 的样本:
country cities airport coords
UK [London,London Luton, Luton] London Luton 12.51
UK [London,London Gatwick, Gatwick] London Gatwick 100.32
Explanation:解释:
[London,**London Luton**, Luton]
from cities matches with **London Luton**
in airport [London,**London Luton**, Luton]
来自城市的匹配在机场与**London Luton**
I have explored some options but couldn't really get there.我已经探索了一些选择,但无法真正实现。 Can anyone help?谁能帮忙? Thanks谢谢
You can utilise array_contains to mark rows which contain your search string and finally filter on the rows which returns True
您可以利用array_contains标记包含搜索字符串的行,最后过滤返回True
的行
d1 = {
'cities':[
['London','London Luton', 'Luton'],
['London','London Gatwick', 'Gatwick']
],
'country':['UK','UK']
}
d2 = {
'country':['UK','UK'],
'city':['London','London'],
'airport':['London Luton','London Gatwick'],
'coords':[12.51,100.32]
}
sparkDF1 = sql.createDataFrame(pd.DataFrame(d1))
sparkDF2 = sql.createDataFrame(pd.DataFrame(d2))
sparkDF1.show(truncate=False)
+---------------------------------+-------+
|cities |country|
+---------------------------------+-------+
|[London, London Luton, Luton] |UK |
|[London, London Gatwick, Gatwick]|UK |
+---------------------------------+-------+
sparkDF2.show()
+-------+------+--------------+------+
|country| city| airport|coords|
+-------+------+--------------+------+
| UK|London| London Luton| 12.51|
| UK|London|London Gatwick|100.32|
+-------+------+--------------+------+
finalDF = sparkDF1.join(sparkDF2
,sparkDF1['country'] == sparkDF2['country']
,'inner'
).select(sparkDF1['*'],sparkDF2['airport'])
finalDF = finalDF.withColumn('flag',F.array_contains( F.col('cities'),F.col('airport') ) )
finalDF.filter(F.col('flag') == True).show(truncate=False)
+---------------------------------+-------+--------------+----+
|cities |country|airport |flag|
+---------------------------------+-------+--------------+----+
|[London, London Luton, Luton] |UK |London Luton |true|
|[London, London Gatwick, Gatwick]|UK |London Gatwick|true|
+---------------------------------+-------+--------------+----+
You could create a new column on the cities table with the airport name.您可以使用机场名称在城市表中创建一个新列。 Then you could simply merge the two tables on the airport
column.然后您可以简单地合并airport
列中的两个表。
Using the prep core from @Vaebhav:使用来自@Vaebhav 的准备核心:
sparkDF1 = sparkDF1.withColumn("airport", F.col("cities")[1])
sparkDF1.show(truncate=False)
+---------------------------------+-------+--------------+
|cities |country|airport |
+---------------------------------+-------+--------------+
|[London, London Luton, Luton] |UK |London Luton |
|[London, London Gatwick, Gatwick]|UK |London Gatwick|
+---------------------------------+-------+--------------+
finalDF = sparkDF1.join(sparkDF2, on="airport", how="right")
finalDF.show(truncate=False)
+--------------+---------------------------------+-------+-------+------+------+
|airport |cities |country|country|city |coords|
+--------------+---------------------------------+-------+-------+------+------+
|London Luton |[London, London Luton, Luton] |UK |UK |London|12.51 |
|London Gatwick|[London, London Gatwick, Gatwick]|UK |UK |London|100.32|
+--------------+---------------------------------+-------+-------+------+------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.