简体   繁体   English

在Spark中加入多个数据框时如何应用Like操作?

[英]How to Apply Like operation while joining multiple data frame in spark?

I am trying to join two data frame and then apply a like operation on it. 我试图加入两个数据框,然后对其应用类似的操作。 But it is not returning any value. 但是它没有返回任何值。 I want to do a pattern match here. 我想在这里进行模式匹配。 Any suggestion what i am doing wrong here. 任何建议我在这里做错了。

import org.apache.spark._
import org.apache.spark.sql.Row

val upcTable = spark.sqlContext.sparkContext.parallelize(Seq(
  Row(1, 50, 100),
  Row(2, 60, 200),
  Row(36, 70, 300),
  Row(45, 80, 400)
))

val lookupUpc = spark.sqlContext.sparkContext.parallelize(Seq(
 Row(3, 70, 300),
 Row(4, 80, 400)
))

val upcDf = spark.sqlContext.createDataFrame(upcTable, StructType(Seq(
  StructField("U_ID", StringType, nullable = false),
  StructField("V_ID", IntegerType, nullable = false),
  StructField("R_ID", IntegerType, nullable = false))))

val lookupDf = spark.sqlContext.createDataFrame(lookupUpc, StructType(Seq(
  StructField("U_ID", StringType, nullable = false),
  StructField("V_ID", IntegerType, nullable = false))))
lookupDf.show()

val joinDf = upcDf.join(lookupDf,Seq("V_ID"),"inner").filter(upcDf("U_ID").like("%lookupDf(U_ID)")).select(upcDf("U_ID"),upcDf("V_ID"),upcDf("R_ID")).show()

Here I wanted 36 and 45 from the upcDf. 在这里,我想要upcDf中的36和45。

Rather than column method like which expects a literal String , method contains which takes an argument of type Any (hence also Column ) would be more suitable in your case: 而不是期望一个文字String 那样的列方法,而包含一个采用Any类型的参数(因此也是Column )的方法更适合您的情况:

val joinDf = upcDf.join(lookupDf, Seq("V_ID"), "inner").
  where(upcDf("U_ID").contains(lookupDf("U_ID"))).
  select(upcDf("U_ID"), upcDf("V_ID"), upcDf("R_ID"))

joinDf.show
// +----+----+----+
// |U_ID|V_ID|R_ID|
// +----+----+----+
// |  45|  80| 400|
// |  36|  70| 300|
// +----+----+----+

Note that column U_ID in your sample dataset should be of String type based on the listed schemas. 请注意,根据列出的架构,示例数据集中的U_ID列应为String类型。

[UPDATE] [更新]

As per clarified requirement from comments, if you want to limit the match to only the leading character I would suggest using method regexp_extract and replace the above where clause with the following: 根据注释中明确的要求,如果您想将匹配限制为仅前导字符,我建议使用regexp_extract方法,并将上述where子句替换为以下内容:

where(lookupDf("U_ID") === regexp_extract(upcDf("U_ID"), "^(.)", 1))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM