[英]Create Nested Array DataFrame From Existing DataFrame
我试图在scala中的'join'操作期间从数据框创建嵌套的struct数组列。 我似乎唯一能够工作的是设置一个元素结构数组,它不会在json输出中写入。
我开始的当前架构是:
root
|-- memberId: integer (nullable = false)
|-- memberSubscriberId: integer (nullable = false)
|-- memberIdSuffix: integer (nullable = false)
|-- memberLastName: string (nullable = false)
|-- memberFirstName: string (nullable = false)
|-- memberMiddleInitial: string (nullable = false)
|-- memberSocialSecurityNumber: string (nullable = false)
|-- memberGender: string (nullable = false)
|-- memberBirthDate: timestamp (nullable = false)
|-- memberworkphonenumber: string (nullable = false)
|-- memberworkphoneextensionnumber: string (nullable = false)
|-- membercellphone: string (nullable = false)
root
|-- memberSubscriberId: integer (nullable = false)
|-- subscriberaddresstypecode: string (nullable = false)
|-- lineOne: string (nullable = false)
|-- lineTwo: string (nullable = false)
|-- lineThree: string (nullable = false)
|-- cityName: string (nullable = false)
|-- stateCode: string (nullable = false)
|-- zipCode: string (nullable = false)
|-- countyCode: string (nullable = false)
|-- countryCode: string (nullable = false)
|-- subscriberphonenumber: string (nullable = false)
|-- subscriberphoneextensionnumber: string (nullable = false)
|-- subscriberfaxnumber: string (nullable = false)
|-- subscriberfaxextensionnumber: string (nullable = false)
|-- address: string (nullable = false)
我想:
root
|-- memberSubscriberId: integer (nullable = false)
|-- memberId: integer (nullable = false)
|-- memberIdSuffix: integer (nullable = false)
|-- memberLastName: string (nullable = false)
|-- memberFirstName: string (nullable = false)
|-- memberMiddleInitial: string (nullable = false)
|-- memberSocialSecurityNumber: string (nullable = false)
|-- memberGender: string (nullable = false)
|-- memberBirthDate: timestamp (nullable = false)
|-- memberworkphonenumber: string (nullable = false)
|-- memberworkphoneextensionnumber: string (nullable = false)
|-- membercellphone: string (nullable = false)
|-- memberAddresses: array (nullable = false)
| |-- lineOne: string (nullable = false)
| |-- lineTwo: string (nullable = false)
| |-- lineThree: string (nullable = false)
| |-- cityName: string (nullable = false)
| |-- stateCode: string (nullable = false)
| |-- zipCode: string (nullable = false)
| |-- countyCode: string (nullable = false)
| |-- countryCode: string (nullable = false)
|-- memeberPhoneNumbers: array (nullable = false)
| |-- phoneNumber: string (nullable = false)
| |-- effectiveDate: null (nullable = true)
| |-- terminationDate: null (nullable = true)
| |-- isCurrent: null (nullable = true)
| |-- isActive: null (nullable = true)
| |-- telecomType: string (nullable = false)
当前代码:
val clientDF: DataFrame
val addrDF: DataFrame
import spark.implicits._
val nestedAddr = addrDF.select(
$"clientSubscriberId",
array(
struct(
$"lineOne",
$"lineTwo",
$"lineThree",
$"cityName",
$"stateCode",
$"zipCode",
$"countyCode",
$"countryCode"
)
).as("clientAddresses"),
array(
struct(
$"subscriberphonenumber".alias("phoneNumber"),
//$"subscriberphoneextensionnumber"
lit(null).alias("effectiveDate"),
lit(null).alias("terminationDate"),
lit(null).alias("isCurrent"),
lit(null).alias("isActive"),
lit("home").alias("telecomType")
),
struct(
$"subscriberfaxnumber".alias("phoneNumber"),
//$"subscriberfaxextensionnumber".map(c => col(c).as("phoneNumber"))
lit(null).alias("effectiveDate"),
lit(null).alias("terminationDate"),
lit(null).alias("isCurrent"),
lit(null).alias("isActive"),
lit("fax").alias("telecomType")
)
).as("memeberPhoneNumbers")
)
val addrMbrDF = mbrDF.join(nestedAddr, Seq("clientSubscriberId"))
结果架构:
root
|-- memberSubscriberId: integer (nullable = false)
|-- memberId: integer (nullable = false)
|-- memberIdSuffix: integer (nullable = false)
|-- memberLastName: string (nullable = false)
|-- memberFirstName: string (nullable = false)
|-- memberMiddleInitial: string (nullable = false)
|-- memberSocialSecurityNumber: string (nullable = false)
|-- memberGender: string (nullable = false)
|-- memberBirthDate: timestamp (nullable = false)
|-- memberworkphonenumber: string (nullable = false)
|-- memberworkphoneextensionnumber: string (nullable = false)
|-- membercellphone: string (nullable = false)
|-- memberAddresses: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- lineOne: string (nullable = false)
| | |-- lineTwo: string (nullable = false)
| | |-- lineThree: string (nullable = false)
| | |-- cityName: string (nullable = false)
| | |-- stateCode: string (nullable = false)
| | |-- zipCode: string (nullable = false)
| | |-- countyCode: string (nullable = false)
| | |-- countryCode: string (nullable = false)
|-- memeberPhoneNumbers: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- phoneNumber: string (nullable = false)
| | |-- effectiveDate: null (nullable = true)
| | |-- terminationDate: null (nullable = true)
| | |-- isCurrent: null (nullable = true)
| | |-- isActive: null (nullable = true)
| | |-- telecomType: string (nullable = false)
Expected schema:
root
|-- memberSubscriberId: integer (nullable = false)
|-- memberId: integer (nullable = false)
|-- memberIdSuffix: integer (nullable = false)
|-- memberLastName: string (nullable = false)
|-- memberFirstName: string (nullable = false)
|-- memberMiddleInitial: string (nullable = false)
|-- memberSocialSecurityNumber: string (nullable = false)
|-- memberGender: string (nullable = false)
|-- memberBirthDate: timestamp (nullable = false)
|-- memberworkphonenumber: string (nullable = false)
|-- memberworkphoneextensionnumber: string (nullable = false)
|-- membercellphone: string (nullable = false)
|-- memberAddresses: array (nullable = false)
| |-- lineOne: string (nullable = false)
| |-- lineTwo: string (nullable = false)
| |-- lineThree: string (nullable = false)
| |-- cityName: string (nullable = false)
| |-- stateCode: string (nullable = false)
| |-- zipCode: string (nullable = false)
| |-- countyCode: string (nullable = false)
| |-- countryCode: string (nullable = false)
|-- memeberPhoneNumbers: array (nullable = false)
| |-- phoneNumber: string (nullable = false)
| |-- effectiveDate: null (nullable = true)
| |-- terminationDate: null (nullable = true)
| |-- isCurrent: null (nullable = true)
| |-- isActive: null (nullable = true)
| |-- telecomType: string (nullable = false)
我尝试过多种不同的东西让它起作用:
).as("clientAddresses"),
array(
struct(
).as("clientAddresses"),
struct(
).as("clientAddresses"),
array(
).as("clientAddresses"),
collect_list(
struct(
简单地说,您想要的预期模式是不可能的。 我的意思是,当你有一个数组时,它总是包含一个具有给定模式的element
,在你的情况下它是一个结构。 所以我实际上说你得到的架构正是你想要实现的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.