简体   繁体   English

将嵌套数组加载到 spark dataframe 列中

[英]Loading nested array into spark dataframe column

I have a nested array which looks like我有一个嵌套数组,看起来像

a = [[1,2],[2,3]]

i have a streaming dataframe which looks like我有一个流媒体 dataframe 看起来像

|system    |level|

+----------+-----+

|Test1     |1    |

|Test2     |3    |

I want to include the array into third column as a nested array.我想将数组作为嵌套数组包含在第三列中。

|system    |level| Data |

+----------+-----+------+

|Test1     |1    |[[1,2],[2,3]]

I tried with column and array function.我尝试使用列和数组 function。 But i am not sure how to use nested array.但我不确定如何使用嵌套数组。

Any help would be appreciated.任何帮助,将不胜感激。

You can add a new column, but you'll have to use a crossJoin :您可以添加一个新列,但您必须使用crossJoin

a = [[1,2],[2,3]]

df.crossJoin(spark.createDataFrame([a], "array<array<bigint>>")).show()

+-------------------+----+------+----------------+
|               date|hour| value|            data|
+-------------------+----+------+----------------+
|1984-01-01 00:00:00|   1|638.55|[[1, 2], [2, 3]]|
|1984-01-01 00:00:00|   2|638.55|[[1, 2], [2, 3]]|
|1984-01-01 00:00:00|   3|638.55|[[1, 2], [2, 3]]|
|1984-01-01 00:00:00|   4|638.55|[[1, 2], [2, 3]]|
|1984-01-01 00:00:00|   5|638.55|[[1, 2], [2, 3]]|
+-------------------+----+------+----------------+

In scala API, we can use "typedLit" function to add the Array or map values in the column.在 scala API 中,我们可以使用 "typedLit" function 来添加数组或 Z1D7AEZ8DC58ED5124FE49151 列中的值。

// Ref: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$ // Ref: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$

Here is the sample code to add an Array as a column value.这是将数组添加为列值的示例代码。

import org.apache.spark.sql.functions.typedLit

val a = Seq((1,2),(2,3))
val df1 = Seq(("Test1", 1), ("Test3", 3)).toDF("a", "b")

df1.withColumn("new_col", typedLit(a)).show()

// Output // Output

+-----+---+----------------+
|    a|  b|         new_col|
+-----+---+----------------+
|Test1|  1|[[1, 2], [2, 3]]|
|Test3|  3|[[1, 2], [2, 3]]|
+-----+---+----------------+

I hope this helps.我希望这有帮助。

If you want to add the same array to all raws then you can use the TypedLit from the sql functions.如果要将相同的数组添加到所有原始数据,则可以使用TypedLit函数中的 TypedLit。 See this answer:看到这个答案:
https://stackoverflow.com/a/32788650/12365294 https://stackoverflow.com/a/32788650/12365294

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM