简体   繁体   中英

String matching between two pyspark dataframe columns

I have a dataframe cities

country      cities
  UK        [London,London Luton, Luton]
  UK        [London,London Gatwick, Gatwick]

and reference dataframe airports :

city         airport            coords
London        London Luton       12.51
London        London Gatwick     100.32

I want to match list of values in cities column with airport column from reference df. If it matches, get the relevance airport names and co ordinates from the reference df.

Sample of the desired output of cities df :

country      cities                                airport            coords
  UK        [London,London Luton, Luton]           London Luton       12.51
  UK        [London,London Gatwick, Gatwick]       London Gatwick     100.32

Explanation:

[London,**London Luton**, Luton] from cities matches with **London Luton** in airport

I have explored some options but couldn't really get there. Can anyone help? Thanks

You can utilise array_contains to mark rows which contain your search string and finally filter on the rows which returns True

Data Preparation

d1 = {
    'cities':[
                 ['London','London Luton', 'Luton'],
                 ['London','London Gatwick', 'Gatwick']
            ],
    'country':['UK','UK']
    
    
}

d2 = {
    'country':['UK','UK'],
    'city':['London','London'],
    'airport':['London Luton','London Gatwick'],
    'coords':[12.51,100.32]
}


sparkDF1 = sql.createDataFrame(pd.DataFrame(d1))
sparkDF2 = sql.createDataFrame(pd.DataFrame(d2))

sparkDF1.show(truncate=False)

+---------------------------------+-------+
|cities                           |country|
+---------------------------------+-------+
|[London, London Luton, Luton]    |UK     |
|[London, London Gatwick, Gatwick]|UK     |
+---------------------------------+-------+

sparkDF2.show()

+-------+------+--------------+------+
|country|  city|       airport|coords|
+-------+------+--------------+------+
|     UK|London|  London Luton| 12.51|
|     UK|London|London Gatwick|100.32|
+-------+------+--------------+------+

Array Contains

finalDF = sparkDF1.join(sparkDF2
                       ,sparkDF1['country'] == sparkDF2['country']
                       ,'inner'
                    ).select(sparkDF1['*'],sparkDF2['airport'])


finalDF = finalDF.withColumn('flag',F.array_contains( F.col('cities'),F.col('airport') ) )


finalDF.filter(F.col('flag') == True).show(truncate=False)

+---------------------------------+-------+--------------+----+
|cities                           |country|airport       |flag|
+---------------------------------+-------+--------------+----+
|[London, London Luton, Luton]    |UK     |London Luton  |true|
|[London, London Gatwick, Gatwick]|UK     |London Gatwick|true|
+---------------------------------+-------+--------------+----+

You could create a new column on the cities table with the airport name. Then you could simply merge the two tables on the airport column.

Using the prep core from @Vaebhav:

sparkDF1 = sparkDF1.withColumn("airport", F.col("cities")[1])

sparkDF1.show(truncate=False)

+---------------------------------+-------+--------------+
|cities                           |country|airport       |
+---------------------------------+-------+--------------+
|[London, London Luton, Luton]    |UK     |London Luton  |
|[London, London Gatwick, Gatwick]|UK     |London Gatwick|
+---------------------------------+-------+--------------+


finalDF = sparkDF1.join(sparkDF2, on="airport", how="right")

finalDF.show(truncate=False)

+--------------+---------------------------------+-------+-------+------+------+
|airport       |cities                           |country|country|city  |coords|
+--------------+---------------------------------+-------+-------+------+------+
|London Luton  |[London, London Luton, Luton]    |UK     |UK     |London|12.51 |
|London Gatwick|[London, London Gatwick, Gatwick]|UK     |UK     |London|100.32|
+--------------+---------------------------------+-------+-------+------+------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM