簡體   English   中英

根據參數SQL查詢,在Spark Dataframe中添加列,該查詢取決於DataFrame某些字段的值

[英]Add Column in a Spark Dataframe ,based on a parametric sql query dependent on values of some fields of the dataframe

我有幾個Spark數據框(我們可以將它們稱為表a,表b等)。 我想基於對其他表之一的查詢結果,僅向表a添加一列,但是此表將根據表a的字段之一的值每次更改。 因此,此查詢應該是參數化的。 下面我顯示一個示例來使問題更清楚:

每個表都有OID列和帶有當前表名稱的TableName列以及其他列。

    This is the fixed query to be performed on Tab A to add new column:

    Select $ColumnName from $TableName where OID=$oids

    Tab A
    |   oids|TableName  |ColumnName | other fields|New Column: ValueOidDb
    ================================================================
    |    2  |  Book      | Title     |      x      |result query:harry potter
    |    8  |  Book      | Isbn      |      y      |result query: 556 
    |    1  |  Author    | Name      |      z      |result query:Tolkien
    |    4  |  Category  |Description|      b      |result query: Commedy


    Tab Book
    |   OID |TableName   |Title       |Isbn  |other fields|
    ================================================================
    |    2  |  Book      |harry potter| 123  | x          |
    |    8  |  Book      | hobbit     | 556  | y          | 
    |    21 |  Book      | etc        | 8942 | z          |
    |    5  |  Book      | etc2       | 984  | b          |

   Tab Author
    |   OID |TableName     |Name        |nationality |other fields|
    ================================================================
    |    5  |  Author      |J.Rowling   | eng        | x          |
    |    2  |  Author      |Geor. Martin| us         | y          | 
    |    1  |  Author      | Tolkien    | eng        | z          |
    |    13 |  Author      | Dan Brown  | us         | b          |


   |   OID | TableName    |Description |
   =====================================
   |    12 |  Category    | Fantasy    | 
   |    4  |  Category    | Commedy    |  
   |    9  |  Category    | Thriller   | 
   |    7  |  Category    | Action     |  

我嘗試過這個udf

    def setValueOid = (oid: Int,TableName: String, TableColumn: String) => {

    try{
      sqlContext.sql(s"Select $currTableColumn from $currTableName where OID = $curroid ").first().toString()
       }
  catch{
      case x: java.lang.NullPointerException =>  "error"  
       }

      }
   sqlContext.udf.register("setValueOid", setValueOid)

   val FinalRtxf =  sqlContext.sql("SELECT all the column of TAB A ,"
                 + " setValueOid(oid, Table,AttributeDatabaseColumn ) as     ValueOidDb"
                 + " FROM TAB A")

我將代碼放在try catch中,因為否則它會給我一個nullpointerexception,但它不起作用,因為它總是返回“問題”。 如果我通過僅傳遞一些手動參數來嘗試此功能而沒有sql查詢,則它會完美運行:

          val try=setValueOid(8,"BOOK","ISBN")
           try: String = [0977326403 ]                    FINISHED   
          Took 4 sec. Last updated by anonymous at November 20 2016, 3:29:28 AM.

我在這里閱讀了無法在udf內進行查詢試圖從UDF執行spark sql查詢

那我該如何解決我的問題呢? 我不知道如何進行參數聯接。 我嘗試了這個:

       %sql
         Select  all attributes TAB A,    
         FROM TAB A  as a
         join (Select $AttributeDatabaseColumn ,TableName  from $Table where  OID=$oid) as b
         on a.Table=b.TableName 

但這給了我這個例外:

  org.apache.spark.sql.AnalysisException: cannot recognize input near  '$'   'AttributeDatabaseColumn' ',' in select clause; line 3 pos 1       at   org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:318)
at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)

一種選擇:

  • 將每個BookAuthorCategory為一種形式:

     root |-- oid: integer (nullable = false) |-- tableName: string (nullable = true) |-- properties: map (nullable = true) | |-- key: string | |-- value: string (valueContainsNull = true) 

    例如Book第一條記錄:

     val book = Seq((2L, "Book", Map("title" -> "harry potter", "Isbn" -> "123", "other field" -> "x") )).toDF("oid", "title", "properties") +---+---------+---------------------------------------------------------+ |oid|tableName|properties | +---+---------+---------------------------------------------------------+ |2 |Book |Map(title -> harry potter, Isbn -> 123, other field -> x)| +---+---------+---------------------------------------------------------+ 
  • 合並BookAuthorCategory作為屬性。

     val properties = book.union(author).union(category) 
  • 與基表連接:

     val comb = properties.join(table, Seq($"oid", $"tableName")) 
  • 基於case when ...基於tableNameproperties字段添加新列的用case when ...

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM