简体   繁体   English

PySpark 用 LIKE 加入数据帧

[英]PySpark join dataframes with LIKE

I try to join dataframes using a LIKE expression in which the conditions (content of LIKE ) is stores in a column.我尝试使用 LIKE 表达式连接数据帧,其中条件( LIKE的内容)存储在列中。 Is it possible in PySpark 2.3? PySpark 2.3有可能吗?

Source dataframe:
+---------+----------+
|firstname|middlename|
+---------+----------+
|    James|          |
|  Michael|      Rose|
|   Robert|  Williams|
|    Maria|      Anne|
+---------+----------+
 
Second dataframe
+---------+----+
|condition|dest|
+---------+----+
|      %a%|Box1|
|      %b%|Box2|
+---------+----+

Expected result:
+---------+----------+---------+----+
|firstname|middlename|condition|dest|
+---------+----------+---------+----+
|    James|          |      %a%|Box1|
|  Michael|      Rose|      %a%|Box1|
|   Robert|  Williams|      %b%|Box2|
|    Maria|      Anne|      %a%|Box1|
+---------+----------+---------+----+

Let me reproduce the issue on the sample below.让我在下面的示例中重现该问题。 Let's create a sample dataframe:让我们创建一个示例 dataframe:

from pyspark.sql.types import StructType,StructField, StringType, IntegerType

data = [("James",""),
    ("Michael","Rose"),
    ("Robert","Williams"),
    ("Maria","Anne")
  ]

schema = StructType([ \
    StructField("firstname",StringType(),True), \
    StructField("middlename",StringType(),True)
  ])
 
df = spark.createDataFrame(data=data,schema=schema)
df.show()

and the second one:第二个:

mapping = [("%a%","Box1"),("%b%","Box2")]
  
schema = StructType([ \
    StructField("condition",StringType(),True), \
    StructField("dest",StringType(),True)
  ])
  
map = spark.createDataFrame(data=mapping,schema=schema)
map.show()

If I am rights, it is not possible to use LIKE during join dataframes, so I have created a crossJoin and tried to use a filter with like, but is it possible to take the content from a column, not a fixed string?如果我是对的,在连接数据帧期间不可能使用 LIKE,所以我创建了一个 crossJoin 并尝试使用类似的过滤器,但是是否可以从列中获取内容,而不是固定字符串? This is invalid syntax of cource, but I am looking for another solution:这是无效的语法,但我正在寻找另一种解决方案:

df.crossJoin(map).filter(df.firstname.like(map.condition)).show()

Any expression can be used as a join condition.任何表达式都可以用作连接条件。 True, with DataFrame API like function's parameter can only be str , not Column , so you can't have col("firstname").like(col("condition")) .没错,对于 DataFrame API 这样的函数参数只能是str ,不能是Column ,所以你不能有col("firstname").like(col("condition")) However SQL version does not have this limitation so you can leverage expr :然而 SQL 版本没有这个限制所以你可以利用expr

df.join(map, expr("firstname like condition")).show()

Or just plain SQL:或者只是简单的 SQL:

df.createOrReplaceTempView("df")
map.createOrReplaceTempView("map")
spark.sql("SELECT * FROM df JOIN map ON firstname like condition").show()

Both return the same result:两者都返回相同的结果:

+---------+----------+---------+----+
|firstname|middlename|condition|dest|
+---------+----------+---------+----+
|    James|          |      %a%|Box1|
|  Michael|      Rose|      %a%|Box1|
|   Robert|  Williams|      %b%|Box2|
|    Maria|      Anne|      %a%|Box1|
+---------+----------+---------+----+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM