繁体   English   中英

从现有DataFrame创建嵌套数组DataFrame

[英]Create Nested Array DataFrame From Existing DataFrame

我试图在scala中的'join'操作期间从数据框创建嵌套的struct数组列。 我似乎唯一能够工作的是设置一个元素结构数组,它不会在json输出中写入。

我开始的当前架构是:

root
 |-- memberId: integer (nullable = false)
 |-- memberSubscriberId: integer (nullable = false)
 |-- memberIdSuffix: integer (nullable = false)
 |-- memberLastName: string (nullable = false)
 |-- memberFirstName: string (nullable = false)
 |-- memberMiddleInitial: string (nullable = false)
 |-- memberSocialSecurityNumber: string (nullable = false)
 |-- memberGender: string (nullable = false)
 |-- memberBirthDate: timestamp (nullable = false)
 |-- memberworkphonenumber: string (nullable = false)
 |-- memberworkphoneextensionnumber: string (nullable = false)
 |-- membercellphone: string (nullable = false)

root
 |-- memberSubscriberId: integer (nullable = false)
 |-- subscriberaddresstypecode: string (nullable = false)
 |-- lineOne: string (nullable = false)
 |-- lineTwo: string (nullable = false)
 |-- lineThree: string (nullable = false)
 |-- cityName: string (nullable = false)
 |-- stateCode: string (nullable = false)
 |-- zipCode: string (nullable = false)
 |-- countyCode: string (nullable = false)
 |-- countryCode: string (nullable = false)
 |-- subscriberphonenumber: string (nullable = false)
 |-- subscriberphoneextensionnumber: string (nullable = false)
 |-- subscriberfaxnumber: string (nullable = false)
 |-- subscriberfaxextensionnumber: string (nullable = false)
 |-- address: string (nullable = false)

我想:

root
 |-- memberSubscriberId: integer (nullable = false)
 |-- memberId: integer (nullable = false)
 |-- memberIdSuffix: integer (nullable = false)
 |-- memberLastName: string (nullable = false)
 |-- memberFirstName: string (nullable = false)
 |-- memberMiddleInitial: string (nullable = false)
 |-- memberSocialSecurityNumber: string (nullable = false)
 |-- memberGender: string (nullable = false)
 |-- memberBirthDate: timestamp (nullable = false)
 |-- memberworkphonenumber: string (nullable = false)
 |-- memberworkphoneextensionnumber: string (nullable = false)
 |-- membercellphone: string (nullable = false)
 |-- memberAddresses: array (nullable = false)
 |    |-- lineOne: string (nullable = false)
 |    |-- lineTwo: string (nullable = false)
 |    |-- lineThree: string (nullable = false)
 |    |-- cityName: string (nullable = false)
 |    |-- stateCode: string (nullable = false)
 |    |-- zipCode: string (nullable = false)
 |    |-- countyCode: string (nullable = false)
 |    |-- countryCode: string (nullable = false)
 |-- memeberPhoneNumbers: array (nullable = false)
 |    |-- phoneNumber: string (nullable = false)
 |    |-- effectiveDate: null (nullable = true)
 |    |-- terminationDate: null (nullable = true)
 |    |-- isCurrent: null (nullable = true)
 |    |-- isActive: null (nullable = true)
 |    |-- telecomType: string (nullable = false)

当前代码:

val clientDF: DataFrame
val addrDF: DataFrame

import spark.implicits._

    val nestedAddr = addrDF.select(
      $"clientSubscriberId",
      array(
        struct(
          $"lineOne",
          $"lineTwo",
          $"lineThree",
          $"cityName",
          $"stateCode",
          $"zipCode",
          $"countyCode",
          $"countryCode"
        )
      ).as("clientAddresses"),
      array(
        struct(
          $"subscriberphonenumber".alias("phoneNumber"),
          //$"subscriberphoneextensionnumber"
          lit(null).alias("effectiveDate"),
          lit(null).alias("terminationDate"),
          lit(null).alias("isCurrent"),
          lit(null).alias("isActive"),
          lit("home").alias("telecomType")
        ),
        struct(
          $"subscriberfaxnumber".alias("phoneNumber"),
          //$"subscriberfaxextensionnumber".map(c => col(c).as("phoneNumber"))
          lit(null).alias("effectiveDate"),
          lit(null).alias("terminationDate"),
          lit(null).alias("isCurrent"),
          lit(null).alias("isActive"),
          lit("fax").alias("telecomType")
        )
      ).as("memeberPhoneNumbers")
    )
    val addrMbrDF = mbrDF.join(nestedAddr, Seq("clientSubscriberId"))

结果架构:

root
 |-- memberSubscriberId: integer (nullable = false)
 |-- memberId: integer (nullable = false)
 |-- memberIdSuffix: integer (nullable = false)
 |-- memberLastName: string (nullable = false)
 |-- memberFirstName: string (nullable = false)
 |-- memberMiddleInitial: string (nullable = false)
 |-- memberSocialSecurityNumber: string (nullable = false)
 |-- memberGender: string (nullable = false)
 |-- memberBirthDate: timestamp (nullable = false)
 |-- memberworkphonenumber: string (nullable = false)
 |-- memberworkphoneextensionnumber: string (nullable = false)
 |-- membercellphone: string (nullable = false)
 |-- memberAddresses: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- lineOne: string (nullable = false)
 |    |    |-- lineTwo: string (nullable = false)
 |    |    |-- lineThree: string (nullable = false)
 |    |    |-- cityName: string (nullable = false)
 |    |    |-- stateCode: string (nullable = false)
 |    |    |-- zipCode: string (nullable = false)
 |    |    |-- countyCode: string (nullable = false)
 |    |    |-- countryCode: string (nullable = false)
 |-- memeberPhoneNumbers: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- phoneNumber: string (nullable = false)
 |    |    |-- effectiveDate: null (nullable = true)
 |    |    |-- terminationDate: null (nullable = true)
 |    |    |-- isCurrent: null (nullable = true)
 |    |    |-- isActive: null (nullable = true)
 |    |    |-- telecomType: string (nullable = false)


Expected schema:
root
 |-- memberSubscriberId: integer (nullable = false)
 |-- memberId: integer (nullable = false)
 |-- memberIdSuffix: integer (nullable = false)
 |-- memberLastName: string (nullable = false)
 |-- memberFirstName: string (nullable = false)
 |-- memberMiddleInitial: string (nullable = false)
 |-- memberSocialSecurityNumber: string (nullable = false)
 |-- memberGender: string (nullable = false)
 |-- memberBirthDate: timestamp (nullable = false)
 |-- memberworkphonenumber: string (nullable = false)
 |-- memberworkphoneextensionnumber: string (nullable = false)
 |-- membercellphone: string (nullable = false)
 |-- memberAddresses: array (nullable = false)
 |    |-- lineOne: string (nullable = false)
 |    |-- lineTwo: string (nullable = false)
 |    |-- lineThree: string (nullable = false)
 |    |-- cityName: string (nullable = false)
 |    |-- stateCode: string (nullable = false)
 |    |-- zipCode: string (nullable = false)
 |    |-- countyCode: string (nullable = false)
 |    |-- countryCode: string (nullable = false)
 |-- memeberPhoneNumbers: array (nullable = false)
 |    |-- phoneNumber: string (nullable = false)
 |    |-- effectiveDate: null (nullable = true)
 |    |-- terminationDate: null (nullable = true)
 |    |-- isCurrent: null (nullable = true)
 |    |-- isActive: null (nullable = true)
 |    |-- telecomType: string (nullable = false)

我尝试过多种不同的东西让它起作用:

      ).as("clientAddresses"),
      array(
        struct(
      ).as("clientAddresses"),
       struct(
      ).as("clientAddresses"),
      array(
      ).as("clientAddresses"),
      collect_list(
        struct(

简单地说,您想要的预期模式是不可能的。 我的意思是,当你有一个数组时,它总是包含一个具有给定模式的element ,在你的情况下它是一个结构。 所以我实际上说你得到的架构正是你想要实现的。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM