[英]Retrieving nested column in python spark dataframe
Part of my df schema: 我的df架构的一部分:
-- result: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- error: string (nullable = true)
| | |-- hop: long (nullable = true)
| | |-- resuLt: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- from: string (nullable = true)
| | | | |-- rtt: double (nullable = true)
| | | | |-- size: long (nullable = true)
| | | | |-- ttl: long (nullable = true)
| | |-- result: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- Rtt: double (nullable = true)
| | | | |-- Ttl: long (nullable = true)
| | | | |-- dstoptsize: long (nullable = true)
| | | | |-- dup: boolean (nullable = true)
| | | | |-- edst: string (nullable = true)
| | | | |-- err: string (nullable = true)
| | | | |-- error: string (nullable = true)
| | | | |-- flags: string (nullable = true)
| | | | |-- from: string (nullable = true)
| | | | |-- hdropts: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- mss: long (nullable = true)
| | | | |-- icmpext: struct (nullable = true)
| | | | | |-- obj: array (nullable = true)
| | | | | | |-- element: struct (containsNull = true)
| | | | | | | |-- class: long (nullable = true)
| | | | | | | |-- mpls: array (nullable = true)
| | | | | | | | |-- element: struct (containsNull = true)
| | | | | | | | | |-- exp: long (nullable = true)
| | | | | | | | | |-- label: long (nullable = true)
| | | | | | | | | |-- s: long (nullable = true)
| | | | | | | | | |-- ttl: long (nullable = true)
| | | | | | | |-- type: long (nullable = true)
| | | | | |-- rfc4884: long (nullable = true)
| | | | | |-- version: long (nullable = true)
| | | | |-- itos: long (nullable = true)
| | | | |-- ittl: long (nullable = true)
| | | | |-- late: long (nullable = true)
| | | | |-- mtu: long (nullable = true)
| | | | |-- rtt: double (nullable = true)
| | | | |-- sIze: long (nullable = true)
| | | | |-- size: long (nullable = true)
| | | | |-- tos: long (nullable = true)
| | | | |-- ttl: long (nullable = true)
| | | | |-- x: string (nullable = true)
How can I query nested column, like result.result.dstopsize
for example? 我如何查询嵌套列,例如
result.result.dstopsize
? I would like to be able to display everything from result
or even result.result
or result.resuLt
(case-sensitive is on in my spark config) 我希望能够显示
result
甚至result.result
或result.resuLt
(区分大小写在我的spark配置中)
When I try: 当我尝试:
file_df.select("result.resuLt.dstopsize").show(10)
I get this error: 我收到此错误:
cannot resolve '`result`.`resuLt`['dstopsize']' due to data type mismatch: argument 2 requires integral type, however, ''dstopsize'' is of string type.;;
EDIT: here is some sample data 编辑:这是一些示例数据
|_corrupt_record| af| dst_addr| dst_name| endtime| from| fw|group_id|lts| msm_id| msm_name|paris_id|prb_id|proto| result|size| src_addr| timestamp| ttr| type|
+---------------+---+---------------+---------------+----------+-------------+----+--------+---+--------+----------+--------+------+-----+--------------------+----+-------------+----------+----+----------+
| null| 4|213.133.109.134|213.133.109.134|1551658584|78.197.253.14|4940| null| 71| 5019|Traceroute| 3| 13230| UDP|[[, 1,, [[,,,,,,,...| 40|192.168.0.130|1551658577|null|traceroute|
| null| 4| 37.143.33.15| 37.143.33.15|1551658584|78.197.253.14|4940|15254159| 71|15254159|Traceroute| 12| 13230| ICMP|[[, 1,, [[,,,,,,,...| 48|192.168.0.130|1551658583|null|traceroute|
| null| 4| 139.162.27.28| 139.162.27.28|1551658612|78.197.253.14|4940| null| 20| 5027|Traceroute| 3| 13230| UDP|[[, 1,, [[,,,,,,,...| 40|192.168.0.130|1551658606|null|traceroute|
| null| 4| 45.33.72.12| 45.33.72.12|1551658610|78.197.253.14|4940| null| 18| 5029|Traceroute| 3| 13230| UDP|[[, 1,, [[,,,,,,,...| 40|192.168.0.130|1551658608|null|traceroute|
| null| 4|104.237.152.132|104.237.152.132|1551658615|78.197.253.14|4940| null| 23| 5028|Traceroute| 3| 13230| UDP|[[, 1,, [[,,,,,,,...| 40|192.168.0.130|1551658608|null|traceroute|
| null| 4| 94.126.208.18| 94.126.208.18|1551658516|37.14.215.183|4940| 9183324| 20| 9183324|Traceroute| 15| 11958| ICMP|[[, 1,, [[,,,,,,,...| 48| 192.168.22.2|1551658439|null|traceroute|
| null| 4|196.192.112.244|196.192.112.244|1551658554|37.14.215.183|4940| 9181461| 25| 9181461|Traceroute| 15| 11958| ICMP|[[, 1,, [[,,,,,,,...| 48| 192.168.22.2|1551658474|null|traceroute|
| null| 4| 46.234.34.8| 46.234.34.8|1551658539|37.14.215.183|4940| 9180758| 10| 9180758|Traceroute| 15| 11958| ICMP|[[, 1,, [[,,,,,,,...| 48| 192.168.22.2|1551658479|null|traceroute|
| null| 4| 185.2.64.76| 185.2.64.76|1551658560|37.14.215.183|4940| 9181290| 31| 9181290|Traceroute| 15| 11958| ICMP|[[, 1,, [[,,,,,,,...| 48| 192.168.22.2|1551658511|null|traceroute|
| null| 4| 208.80.155.69| 208.80.155.69|1551658597|37.14.215.183|4940| 9183716| 8| 9183716|Traceroute| 15| 11958| ICMP|[[, 1,, [[,,,,,,,...| 48| 192.168.22.2|1551658546|null|traceroute|
+---------------+---+---------------+---------------+----------+-------------+----+--------+---+--------+----------+--------+------+-----+--------------------+----+-------------+----------+----+----------+```
This is atrociously ugly but that would return the correct answer given that you know which column indexes are your nested fields. 这是非常丑陋的但是如果您知道哪些列索引是您的嵌套字段,那将返回正确的答案。 That can be computed based on the schema.
这可以基于模式计算。
case class C(x: String, dstoptsize: Long, y: String)
case class B(result: Array[C])
case class A(result: Array[B])
val df = List(
A(Array(
B(Array(C("x10", 10, "y10"), C("x11", 11, "y11"))),
B(Array(C("x12", 12, "y12"), C("x13", 13, "y13")))
)),
A(Array(B(Array(C("x20", 20, "y20"), C("x21", 21, "y21")))))).toDF
val selectInner = udf((x: Seq[Row]) => { x.map(_.getSeq[Row](0).map(_.getLong(1))) })
df.select(selectInner($"result")).show
+--------------------+
| UDF(result)|
+--------------------+
|[[10, 11], [12, 13]]|
| [[20, 21]]|
+--------------------+
The result
is of array
type so you will need to explode
or explode_outer
and then access whatever you need to access. result
是array
类型,因此您需要explode
或explode_outer
然后访问您需要访问的任何内容。
from pyspark.sql.functions import explode_outer, col
file_df.withColumn("exploded_result", explode_outer(col("result")))
.select("exploded_result.resuLt.dstopsize").show(10)
Be cautious though, you will have multiple rows corresponding to individual row depending on number of elements. 但要小心,根据元素的数量,您将有多行对应于各行。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.