简体   繁体   中英

PySpark join dataframes with LIKE

I try to join dataframes using a LIKE expression in which the conditions (content of LIKE ) is stores in a column. Is it possible in PySpark 2.3?

Source dataframe:
+---------+----------+
|firstname|middlename|
+---------+----------+
|    James|          |
|  Michael|      Rose|
|   Robert|  Williams|
|    Maria|      Anne|
+---------+----------+
 
Second dataframe
+---------+----+
|condition|dest|
+---------+----+
|      %a%|Box1|
|      %b%|Box2|
+---------+----+

Expected result:
+---------+----------+---------+----+
|firstname|middlename|condition|dest|
+---------+----------+---------+----+
|    James|          |      %a%|Box1|
|  Michael|      Rose|      %a%|Box1|
|   Robert|  Williams|      %b%|Box2|
|    Maria|      Anne|      %a%|Box1|
+---------+----------+---------+----+

Let me reproduce the issue on the sample below. Let's create a sample dataframe:

from pyspark.sql.types import StructType,StructField, StringType, IntegerType

data = [("James",""),
    ("Michael","Rose"),
    ("Robert","Williams"),
    ("Maria","Anne")
  ]

schema = StructType([ \
    StructField("firstname",StringType(),True), \
    StructField("middlename",StringType(),True)
  ])
 
df = spark.createDataFrame(data=data,schema=schema)
df.show()

and the second one:

mapping = [("%a%","Box1"),("%b%","Box2")]
  
schema = StructType([ \
    StructField("condition",StringType(),True), \
    StructField("dest",StringType(),True)
  ])
  
map = spark.createDataFrame(data=mapping,schema=schema)
map.show()

If I am rights, it is not possible to use LIKE during join dataframes, so I have created a crossJoin and tried to use a filter with like, but is it possible to take the content from a column, not a fixed string? This is invalid syntax of cource, but I am looking for another solution:

df.crossJoin(map).filter(df.firstname.like(map.condition)).show()

Any expression can be used as a join condition. True, with DataFrame API like function's parameter can only be str , not Column , so you can't have col("firstname").like(col("condition")) . However SQL version does not have this limitation so you can leverage expr :

df.join(map, expr("firstname like condition")).show()

Or just plain SQL:

df.createOrReplaceTempView("df")
map.createOrReplaceTempView("map")
spark.sql("SELECT * FROM df JOIN map ON firstname like condition").show()

Both return the same result:

+---------+----------+---------+----+
|firstname|middlename|condition|dest|
+---------+----------+---------+----+
|    James|          |      %a%|Box1|
|  Michael|      Rose|      %a%|Box1|
|   Robert|  Williams|      %b%|Box2|
|    Maria|      Anne|      %a%|Box1|
+---------+----------+---------+----+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM