簡體   English   中英

從現有DataFrame創建嵌套數組DataFrame

[英]Create Nested Array DataFrame From Existing DataFrame

我試圖在scala中的'join'操作期間從數據框創建嵌套的struct數組列。 我似乎唯一能夠工作的是設置一個元素結構數組,它不會在json輸出中寫入。

我開始的當前架構是:

root
 |-- memberId: integer (nullable = false)
 |-- memberSubscriberId: integer (nullable = false)
 |-- memberIdSuffix: integer (nullable = false)
 |-- memberLastName: string (nullable = false)
 |-- memberFirstName: string (nullable = false)
 |-- memberMiddleInitial: string (nullable = false)
 |-- memberSocialSecurityNumber: string (nullable = false)
 |-- memberGender: string (nullable = false)
 |-- memberBirthDate: timestamp (nullable = false)
 |-- memberworkphonenumber: string (nullable = false)
 |-- memberworkphoneextensionnumber: string (nullable = false)
 |-- membercellphone: string (nullable = false)

root
 |-- memberSubscriberId: integer (nullable = false)
 |-- subscriberaddresstypecode: string (nullable = false)
 |-- lineOne: string (nullable = false)
 |-- lineTwo: string (nullable = false)
 |-- lineThree: string (nullable = false)
 |-- cityName: string (nullable = false)
 |-- stateCode: string (nullable = false)
 |-- zipCode: string (nullable = false)
 |-- countyCode: string (nullable = false)
 |-- countryCode: string (nullable = false)
 |-- subscriberphonenumber: string (nullable = false)
 |-- subscriberphoneextensionnumber: string (nullable = false)
 |-- subscriberfaxnumber: string (nullable = false)
 |-- subscriberfaxextensionnumber: string (nullable = false)
 |-- address: string (nullable = false)

我想:

root
 |-- memberSubscriberId: integer (nullable = false)
 |-- memberId: integer (nullable = false)
 |-- memberIdSuffix: integer (nullable = false)
 |-- memberLastName: string (nullable = false)
 |-- memberFirstName: string (nullable = false)
 |-- memberMiddleInitial: string (nullable = false)
 |-- memberSocialSecurityNumber: string (nullable = false)
 |-- memberGender: string (nullable = false)
 |-- memberBirthDate: timestamp (nullable = false)
 |-- memberworkphonenumber: string (nullable = false)
 |-- memberworkphoneextensionnumber: string (nullable = false)
 |-- membercellphone: string (nullable = false)
 |-- memberAddresses: array (nullable = false)
 |    |-- lineOne: string (nullable = false)
 |    |-- lineTwo: string (nullable = false)
 |    |-- lineThree: string (nullable = false)
 |    |-- cityName: string (nullable = false)
 |    |-- stateCode: string (nullable = false)
 |    |-- zipCode: string (nullable = false)
 |    |-- countyCode: string (nullable = false)
 |    |-- countryCode: string (nullable = false)
 |-- memeberPhoneNumbers: array (nullable = false)
 |    |-- phoneNumber: string (nullable = false)
 |    |-- effectiveDate: null (nullable = true)
 |    |-- terminationDate: null (nullable = true)
 |    |-- isCurrent: null (nullable = true)
 |    |-- isActive: null (nullable = true)
 |    |-- telecomType: string (nullable = false)

當前代碼:

val clientDF: DataFrame
val addrDF: DataFrame

import spark.implicits._

    val nestedAddr = addrDF.select(
      $"clientSubscriberId",
      array(
        struct(
          $"lineOne",
          $"lineTwo",
          $"lineThree",
          $"cityName",
          $"stateCode",
          $"zipCode",
          $"countyCode",
          $"countryCode"
        )
      ).as("clientAddresses"),
      array(
        struct(
          $"subscriberphonenumber".alias("phoneNumber"),
          //$"subscriberphoneextensionnumber"
          lit(null).alias("effectiveDate"),
          lit(null).alias("terminationDate"),
          lit(null).alias("isCurrent"),
          lit(null).alias("isActive"),
          lit("home").alias("telecomType")
        ),
        struct(
          $"subscriberfaxnumber".alias("phoneNumber"),
          //$"subscriberfaxextensionnumber".map(c => col(c).as("phoneNumber"))
          lit(null).alias("effectiveDate"),
          lit(null).alias("terminationDate"),
          lit(null).alias("isCurrent"),
          lit(null).alias("isActive"),
          lit("fax").alias("telecomType")
        )
      ).as("memeberPhoneNumbers")
    )
    val addrMbrDF = mbrDF.join(nestedAddr, Seq("clientSubscriberId"))

結果架構:

root
 |-- memberSubscriberId: integer (nullable = false)
 |-- memberId: integer (nullable = false)
 |-- memberIdSuffix: integer (nullable = false)
 |-- memberLastName: string (nullable = false)
 |-- memberFirstName: string (nullable = false)
 |-- memberMiddleInitial: string (nullable = false)
 |-- memberSocialSecurityNumber: string (nullable = false)
 |-- memberGender: string (nullable = false)
 |-- memberBirthDate: timestamp (nullable = false)
 |-- memberworkphonenumber: string (nullable = false)
 |-- memberworkphoneextensionnumber: string (nullable = false)
 |-- membercellphone: string (nullable = false)
 |-- memberAddresses: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- lineOne: string (nullable = false)
 |    |    |-- lineTwo: string (nullable = false)
 |    |    |-- lineThree: string (nullable = false)
 |    |    |-- cityName: string (nullable = false)
 |    |    |-- stateCode: string (nullable = false)
 |    |    |-- zipCode: string (nullable = false)
 |    |    |-- countyCode: string (nullable = false)
 |    |    |-- countryCode: string (nullable = false)
 |-- memeberPhoneNumbers: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- phoneNumber: string (nullable = false)
 |    |    |-- effectiveDate: null (nullable = true)
 |    |    |-- terminationDate: null (nullable = true)
 |    |    |-- isCurrent: null (nullable = true)
 |    |    |-- isActive: null (nullable = true)
 |    |    |-- telecomType: string (nullable = false)


Expected schema:
root
 |-- memberSubscriberId: integer (nullable = false)
 |-- memberId: integer (nullable = false)
 |-- memberIdSuffix: integer (nullable = false)
 |-- memberLastName: string (nullable = false)
 |-- memberFirstName: string (nullable = false)
 |-- memberMiddleInitial: string (nullable = false)
 |-- memberSocialSecurityNumber: string (nullable = false)
 |-- memberGender: string (nullable = false)
 |-- memberBirthDate: timestamp (nullable = false)
 |-- memberworkphonenumber: string (nullable = false)
 |-- memberworkphoneextensionnumber: string (nullable = false)
 |-- membercellphone: string (nullable = false)
 |-- memberAddresses: array (nullable = false)
 |    |-- lineOne: string (nullable = false)
 |    |-- lineTwo: string (nullable = false)
 |    |-- lineThree: string (nullable = false)
 |    |-- cityName: string (nullable = false)
 |    |-- stateCode: string (nullable = false)
 |    |-- zipCode: string (nullable = false)
 |    |-- countyCode: string (nullable = false)
 |    |-- countryCode: string (nullable = false)
 |-- memeberPhoneNumbers: array (nullable = false)
 |    |-- phoneNumber: string (nullable = false)
 |    |-- effectiveDate: null (nullable = true)
 |    |-- terminationDate: null (nullable = true)
 |    |-- isCurrent: null (nullable = true)
 |    |-- isActive: null (nullable = true)
 |    |-- telecomType: string (nullable = false)

我嘗試過多種不同的東西讓它起作用:

      ).as("clientAddresses"),
      array(
        struct(
      ).as("clientAddresses"),
       struct(
      ).as("clientAddresses"),
      array(
      ).as("clientAddresses"),
      collect_list(
        struct(

簡單地說,您想要的預期模式是不可能的。 我的意思是,當你有一個數組時,它總是包含一個具有給定模式的element ,在你的情況下它是一個結構。 所以我實際上說你得到的架構正是你想要實現的。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM