简体   繁体   English

从 dataframe 传递日期值以在 Spark /Scala 中查询

[英]Pass date values from dataframe to query in Spark /Scala

I have a dataframe having a date column and has values as below.我有一个 dataframe 有一个日期列,值如下。

df.show()
+----+----------+
|name|       dob|
+----+----------+
| Jon|2001-04-15|
| Ben|2002-03-01|
+----+----------+

Now i need to query a table in hive which has "dob" from above dataframe (both 2001-04-15, 2002-03-01).So I need to pass the values under dob column as a parameter to my hive query. Now i need to query a table in hive which has "dob" from above dataframe (both 2001-04-15, 2002-03-01).So I need to pass the values under dob column as a parameter to my hive query.

I tried to collect the values to a variable like below which give me array of string.我试图将这些值收集到如下所示的变量中,该变量为我提供了字符串数组。

val dobRead = df.select("updt_d").distinct().as[String].collect()
dobRead: Array[String] = Array(2001-04-15, 2002-03-01)

However when i try to pass to the query i see its not substituting properly and get error.但是,当我尝试传递给查询时,我发现它没有正确替换并出现错误。

val tableRead = hive.executeQuery(s"select emp_name,emp_no,martial_status from <<table_name>> where dateOfBirth in ($dobRead)")
org.apache.hadoop.hive.ql.metadata.HiveException: Failed to compile query: org.apache.hadoop.hive.ql.parse.ParseException: line 1:480 cannot recognize input near '(' '[' 'Ljava' in expression specification

Can you please help me how to pass date values to a query in spark.您能帮我如何将日期值传递给 Spark 中的查询吗?

You can collect the dates as follows ( Row.getAs ):您可以按如下方式收集日期( Row.getAs ):

val rows: Array[Row] = df.select("updt_d").distinct().collect()
val dates: Array[String] = rows.map(_.getAs[String](0))

And then build the query:然后构建查询:

val hql: String = s"select ... where dateOfBirth in (${
  dates.map(d => s"'${d}'").mkString(", ")
})"

Option 2选项 2

If the number of dates in first DataFrame is too big, you should use join operations instead of collecting them into the driver.如果第一个 DataFrame 中的日期数量太大,则应使用连接操作,而不是将它们收集到驱动程序中。

First, load every table as DataFrames (I'll call them dfEmp and dfDates ).首先,将每个表加载为 DataFrame(我将它们dfEmpdfDates )。 Then you can join on date fields to filter, either using a standard inner join plus filtering out null fields or using directly a left_semi join:然后,您可以加入要过滤的日期字段,或者使用标准inner连接加上过滤掉 null 字段,或者直接使用left_semi连接:

val dfEmp = hiveContext.table("EmpTable")
val dfEmpFiltered = dfEmp.join(dfDates,
  col("dateOfBirth") === col("updt_d"), "left_semi")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM