Part of my df schema:
-- result: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- error: string (nullable = true)
| | |-- hop: long (nullable = true)
| | |-- resuLt: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- from: string (nullable = true)
| | | | |-- rtt: double (nullable = true)
| | | | |-- size: long (nullable = true)
| | | | |-- ttl: long (nullable = true)
| | |-- result: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- Rtt: double (nullable = true)
| | | | |-- Ttl: long (nullable = true)
| | | | |-- dstoptsize: long (nullable = true)
| | | | |-- dup: boolean (nullable = true)
| | | | |-- edst: string (nullable = true)
| | | | |-- err: string (nullable = true)
| | | | |-- error: string (nullable = true)
| | | | |-- flags: string (nullable = true)
| | | | |-- from: string (nullable = true)
| | | | |-- hdropts: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- mss: long (nullable = true)
| | | | |-- icmpext: struct (nullable = true)
| | | | | |-- obj: array (nullable = true)
| | | | | | |-- element: struct (containsNull = true)
| | | | | | | |-- class: long (nullable = true)
| | | | | | | |-- mpls: array (nullable = true)
| | | | | | | | |-- element: struct (containsNull = true)
| | | | | | | | | |-- exp: long (nullable = true)
| | | | | | | | | |-- label: long (nullable = true)
| | | | | | | | | |-- s: long (nullable = true)
| | | | | | | | | |-- ttl: long (nullable = true)
| | | | | | | |-- type: long (nullable = true)
| | | | | |-- rfc4884: long (nullable = true)
| | | | | |-- version: long (nullable = true)
| | | | |-- itos: long (nullable = true)
| | | | |-- ittl: long (nullable = true)
| | | | |-- late: long (nullable = true)
| | | | |-- mtu: long (nullable = true)
| | | | |-- rtt: double (nullable = true)
| | | | |-- sIze: long (nullable = true)
| | | | |-- size: long (nullable = true)
| | | | |-- tos: long (nullable = true)
| | | | |-- ttl: long (nullable = true)
| | | | |-- x: string (nullable = true)
How can I query nested column, like result.result.dstopsize
for example? I would like to be able to display everything from result
or even result.result
or result.resuLt
(case-sensitive is on in my spark config)
When I try:
file_df.select("result.resuLt.dstopsize").show(10)
I get this error:
cannot resolve '`result`.`resuLt`['dstopsize']' due to data type mismatch: argument 2 requires integral type, however, ''dstopsize'' is of string type.;;
EDIT: here is some sample data
|_corrupt_record| af| dst_addr| dst_name| endtime| from| fw|group_id|lts| msm_id| msm_name|paris_id|prb_id|proto| result|size| src_addr| timestamp| ttr| type|
+---------------+---+---------------+---------------+----------+-------------+----+--------+---+--------+----------+--------+------+-----+--------------------+----+-------------+----------+----+----------+
| null| 4|213.133.109.134|213.133.109.134|1551658584|78.197.253.14|4940| null| 71| 5019|Traceroute| 3| 13230| UDP|[[, 1,, [[,,,,,,,...| 40|192.168.0.130|1551658577|null|traceroute|
| null| 4| 37.143.33.15| 37.143.33.15|1551658584|78.197.253.14|4940|15254159| 71|15254159|Traceroute| 12| 13230| ICMP|[[, 1,, [[,,,,,,,...| 48|192.168.0.130|1551658583|null|traceroute|
| null| 4| 139.162.27.28| 139.162.27.28|1551658612|78.197.253.14|4940| null| 20| 5027|Traceroute| 3| 13230| UDP|[[, 1,, [[,,,,,,,...| 40|192.168.0.130|1551658606|null|traceroute|
| null| 4| 45.33.72.12| 45.33.72.12|1551658610|78.197.253.14|4940| null| 18| 5029|Traceroute| 3| 13230| UDP|[[, 1,, [[,,,,,,,...| 40|192.168.0.130|1551658608|null|traceroute|
| null| 4|104.237.152.132|104.237.152.132|1551658615|78.197.253.14|4940| null| 23| 5028|Traceroute| 3| 13230| UDP|[[, 1,, [[,,,,,,,...| 40|192.168.0.130|1551658608|null|traceroute|
| null| 4| 94.126.208.18| 94.126.208.18|1551658516|37.14.215.183|4940| 9183324| 20| 9183324|Traceroute| 15| 11958| ICMP|[[, 1,, [[,,,,,,,...| 48| 192.168.22.2|1551658439|null|traceroute|
| null| 4|196.192.112.244|196.192.112.244|1551658554|37.14.215.183|4940| 9181461| 25| 9181461|Traceroute| 15| 11958| ICMP|[[, 1,, [[,,,,,,,...| 48| 192.168.22.2|1551658474|null|traceroute|
| null| 4| 46.234.34.8| 46.234.34.8|1551658539|37.14.215.183|4940| 9180758| 10| 9180758|Traceroute| 15| 11958| ICMP|[[, 1,, [[,,,,,,,...| 48| 192.168.22.2|1551658479|null|traceroute|
| null| 4| 185.2.64.76| 185.2.64.76|1551658560|37.14.215.183|4940| 9181290| 31| 9181290|Traceroute| 15| 11958| ICMP|[[, 1,, [[,,,,,,,...| 48| 192.168.22.2|1551658511|null|traceroute|
| null| 4| 208.80.155.69| 208.80.155.69|1551658597|37.14.215.183|4940| 9183716| 8| 9183716|Traceroute| 15| 11958| ICMP|[[, 1,, [[,,,,,,,...| 48| 192.168.22.2|1551658546|null|traceroute|
+---------------+---+---------------+---------------+----------+-------------+----+--------+---+--------+----------+--------+------+-----+--------------------+----+-------------+----------+----+----------+```
This is atrociously ugly but that would return the correct answer given that you know which column indexes are your nested fields. That can be computed based on the schema.
case class C(x: String, dstoptsize: Long, y: String)
case class B(result: Array[C])
case class A(result: Array[B])
val df = List(
A(Array(
B(Array(C("x10", 10, "y10"), C("x11", 11, "y11"))),
B(Array(C("x12", 12, "y12"), C("x13", 13, "y13")))
)),
A(Array(B(Array(C("x20", 20, "y20"), C("x21", 21, "y21")))))).toDF
val selectInner = udf((x: Seq[Row]) => { x.map(_.getSeq[Row](0).map(_.getLong(1))) })
df.select(selectInner($"result")).show
+--------------------+
| UDF(result)|
+--------------------+
|[[10, 11], [12, 13]]|
| [[20, 21]]|
+--------------------+
The result
is of array
type so you will need to explode
or explode_outer
and then access whatever you need to access.
from pyspark.sql.functions import explode_outer, col
file_df.withColumn("exploded_result", explode_outer(col("result")))
.select("exploded_result.resuLt.dstopsize").show(10)
Be cautious though, you will have multiple rows corresponding to individual row depending on number of elements.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.