简体   繁体   中英

How to use LIKE operator as a JOIN condition in pyspark as a column

I would like to do the following in pyspark (for AWS Glue jobs):

JOIN a and b ON a.name = b.name AND a.number= b.number AND a.city LIKE b.city

So for example:

Table a:

Number Name City
1000 Bob %
2000 Joe London

Table b:

Number Name City
1000 Bob Boston
1000 Bob Berlin
2000 Joe Paris

Results

Number Name City
1000 Bob Boston
1000 Bob Berlin

So the part I don't know how to do is to implement the wildcard "%" and use the LIKE operator. I know you can use .like() on strings, for example:

df.where(col('col1').like("%string%")).show()

But it expects a string, where in my case I would like to do it as a column. Something like the following:

result = a.join(
    b,
    (a.name == b.name) &
    (a.number == b.number) &
    (a.city.like(b.city)) # <-- This doesnt work since it is not a string

Any help to do this will be very appreciated!

Try using an expression:

import pyspark.sql.functions as F

result = a.alias('a').join(
    b.alias('b'),
    (a.name == b.name) &
    (a.number == b.number) &
    F.expr("b.city like a.city")
)

I think you meant to do b like a rather than a like b because the % is in table a.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM