简体   繁体   English

Spark在best.startswith match上加入两个数据帧

[英]Spark join two dataframes on best .startswith match

Two dataframes with phone numbers with ids that could only be matched via regex, eg两个数据帧,其电话号码的 ID 只能通过正则表达式匹配,例如

idPrefix id前缀 other columns其他栏目
420 420
42055 42055
420551 420551
phoneNumber电话号码 other columns其他栏目
420551666 420551666
421709560 421709560

I would need to join these dataframes on the best match of idPrefix to the phoneNumber , matching the longest starting prefix possible, if there is one.我需要在idPrefixphoneNumber的最佳匹配上加入这些数据帧,匹配可能的最长起始前缀(如果有的话)。 Eg if there were any option to join on longest idPrefix for phoneNumber.startswith(idPrefix) , that would be great.例如,如果有任何选项可以加入phoneNumber.startswith(idPrefix)的最长idPrefix ,那就太好了。

eg result would look like例如结果看起来像在此处输入图像描述

I tried a few UDFs with regex instead of a df.join() , but there does not seem to be a way to have a dataframe as an input to UDF.我尝试了一些使用正则表达式而不是df.join()的 UDF,但似乎没有办法将 dataframe 作为 UDF 的输入。

Join on startsWith condition then use row_number to keep longest idPrefix per phoneNumber :加入startsWith条件,然后使用row_number保持每个phoneNumber最长idPrefix

import org.apache.spark.sql.expressions.Window

val df1 = Seq(("420"), ("42055"), ("420551")).toDF("idPrefix")
val df2 = Seq(("420551666"), ("421709560")).toDF("phoneNumber")

val df = (df2
  .join(df1, col("phoneNumber").startsWith(col("idPrefix")), "left")
  .withColumn(
    "rn",
    row_number().over(
      Window.partitionBy("phoneNumber").orderBy(length(col("idPrefix")).desc)
    )
  )
  .filter("rn = 1")
  .drop("rn")
)

df.show
//+-----------+--------+
//|phoneNumber|idPrefix|
//+-----------+--------+
//|  421709560|    null|
//|  420551666|  420551|
//+-----------+--------+

It's true that you can use directly the startsWith in the join, but this will require that one of the tables can be broadcasted, if this case can't be achieved it will become a join with a cartesian product complexity making it very inefficient.确实可以在join中直接使用startsWith,但这需要广播其中一个表,如果这种情况无法实现,它将成为具有笛卡尔积复杂度的join,从而使其效率非常低。 If both dataframes are very large, you can try to make a join first by prefixes and then filter by the startsWith.如果两个数据框都非常大,您可以尝试先通过前缀进行连接,然后通过startsWith进行过滤。

The code will look something like this:代码将如下所示:

val df1 = Seq(("420"), ("42055"), ("420551")).toDF("idPrefix")
val df2 = Seq(("420551666"), ("421709560")).toDF("phoneNumber")

val prefixSize = df1.select(min(length(col("idPrefix")))).as[Int].head()
//takes the length of the smallest prefix, or if its static set it hardcoded

val df = df2
  .join(df1, substring(col("idPrefix"), 0, prefixSize) === substring(col("phoneNumber"), 0, prefixSize), "left")
  .filter(col("phoneNumber").startsWith(col("idPrefix")) || col("idPrefix").isNull) //join by the prefix and filter all false elements
  .groupBy("phoneNumber").agg(max(struct(length(col("idPrefix").as("size")), col("idPrefix"))).as("idPrefix"))
  .withColumn("idPrefix", col("idPrefix.idPrefix")) //take the maximum length for the phone number in a single aggregation

df.show
//+-----------+--------+
//|phoneNumber|idPrefix|
//+-----------+--------+
//|  420551666|  420551|
//|  421709560|    null|
//+-----------+--------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM