![](/img/trans.png)
[英]Spark - query dataframe based on values from a column in another dataframe
[英]Add Column in a Spark Dataframe ,based on a parametric sql query dependent on values of some fields of the dataframe
我有几个Spark数据框(我们可以将它们称为表a,表b等)。 我想基于对其他表之一的查询结果,仅向表a添加一列,但是此表将根据表a的字段之一的值每次更改。 因此,此查询应该是参数化的。 下面我显示一个示例来使问题更清楚:
每个表都有OID列和带有当前表名称的TableName列以及其他列。
This is the fixed query to be performed on Tab A to add new column:
Select $ColumnName from $TableName where OID=$oids
Tab A
| oids|TableName |ColumnName | other fields|New Column: ValueOidDb
================================================================
| 2 | Book | Title | x |result query:harry potter
| 8 | Book | Isbn | y |result query: 556
| 1 | Author | Name | z |result query:Tolkien
| 4 | Category |Description| b |result query: Commedy
Tab Book
| OID |TableName |Title |Isbn |other fields|
================================================================
| 2 | Book |harry potter| 123 | x |
| 8 | Book | hobbit | 556 | y |
| 21 | Book | etc | 8942 | z |
| 5 | Book | etc2 | 984 | b |
Tab Author
| OID |TableName |Name |nationality |other fields|
================================================================
| 5 | Author |J.Rowling | eng | x |
| 2 | Author |Geor. Martin| us | y |
| 1 | Author | Tolkien | eng | z |
| 13 | Author | Dan Brown | us | b |
| OID | TableName |Description |
=====================================
| 12 | Category | Fantasy |
| 4 | Category | Commedy |
| 9 | Category | Thriller |
| 7 | Category | Action |
我尝试过这个udf
def setValueOid = (oid: Int,TableName: String, TableColumn: String) => {
try{
sqlContext.sql(s"Select $currTableColumn from $currTableName where OID = $curroid ").first().toString()
}
catch{
case x: java.lang.NullPointerException => "error"
}
}
sqlContext.udf.register("setValueOid", setValueOid)
val FinalRtxf = sqlContext.sql("SELECT all the column of TAB A ,"
+ " setValueOid(oid, Table,AttributeDatabaseColumn ) as ValueOidDb"
+ " FROM TAB A")
我将代码放在try catch中,因为否则它会给我一个nullpointerexception,但它不起作用,因为它总是返回“问题”。 如果我通过仅传递一些手动参数来尝试此功能而没有sql查询,则它会完美运行:
val try=setValueOid(8,"BOOK","ISBN")
try: String = [0977326403 ] FINISHED
Took 4 sec. Last updated by anonymous at November 20 2016, 3:29:28 AM.
我在这里阅读了无法在udf内进行查询试图从UDF执行spark sql查询
那我该如何解决我的问题呢? 我不知道如何进行参数联接。 我尝试了这个:
%sql
Select all attributes TAB A,
FROM TAB A as a
join (Select $AttributeDatabaseColumn ,TableName from $Table where OID=$oid) as b
on a.Table=b.TableName
但这给了我这个例外:
org.apache.spark.sql.AnalysisException: cannot recognize input near '$' 'AttributeDatabaseColumn' ',' in select clause; line 3 pos 1 at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:318)
at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
一种选择:
将每个Book
, Author
, Category
为一种形式:
root |-- oid: integer (nullable = false) |-- tableName: string (nullable = true) |-- properties: map (nullable = true) | |-- key: string | |-- value: string (valueContainsNull = true)
例如Book
第一条记录:
val book = Seq((2L, "Book", Map("title" -> "harry potter", "Isbn" -> "123", "other field" -> "x") )).toDF("oid", "title", "properties") +---+---------+---------------------------------------------------------+ |oid|tableName|properties | +---+---------+---------------------------------------------------------+ |2 |Book |Map(title -> harry potter, Isbn -> 123, other field -> x)| +---+---------+---------------------------------------------------------+
合并Book
, Author
, Category
作为属性。
val properties = book.union(author).union(category)
与基表连接:
val comb = properties.join(table, Seq($"oid", $"tableName"))
基于case when ...
基于tableName
从properties
字段添加新列的用case when ...
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.