[英]How to get columns from dataframe into a list in spark
I have a DataFrame
that has like 80 columns, and I need to get 12 of them into a collection, either Array
or List
is fine. 我有一个具有80列的
DataFrame
,我需要将其中12列放入一个集合中, Array
或List
都可以。 I did google a bit and found this: 我做了一点谷歌,发现这个:
dataFrame.select("YOUR_COLUMN_NAME").rdd.map(r => r(0)).collect()
The problem is, this works for one column. 问题是,这仅适用于一列。 If I do
df.select(col1,col2,col3...).rdd.map.collect()
, then it's giving me something like this: Array[[col1,col2,col3]]
. 如果我做
df.select(col1,col2,col3...).rdd.map.collect()
,那么它给了我这样的东西: Array[[col1,col2,col3]]
。
What I want is Array[[col1],[col2],[col3]]
. 我想要的是
Array[[col1],[col2],[col3]]
。 Is there any way to do this in Spark? Spark有什么办法做到这一点?
Thanks in advance. 提前致谢。
UPDATE 更新
For example I have a dataframe: 例如我有一个数据框:
----------
A B C
----------
1 2 3
4 5 6
I need to get the columns into this format: 我需要将列转换成这种格式:
Array[[1,4],[2,5],[3,6]]
Hope this is more clear...Sorry for the confusion 希望这更加清楚...对不起您的困惑
you can get Array[Array[Any]]
by doing the following 您可以通过执行以下操作获得
Array[Array[Any]]
scala> df.select("col1", "col2", "col3", "col4").rdd.map(row => (Array(row(0)), Array(row(1)), Array(row(2)), Array(row(3))))
res6: org.apache.spark.rdd.RDD[(Array[Any], Array[Any], Array[Any], Array[Any])] = MapPartitionsRDD[34] at map at <console>:32
RDD
is like an Array
so your required array is above. RDD
就像一个Array
因此您需要的数组在上面。 If you want RDD[Array[Array[Any]]]
then you can do 如果您想要
RDD[Array[Array[Any]]]
则可以
scala> df.select("col1", "col2", "col3", "col4").rdd.map(row => Array(Array(row(0)), Array(row(1)), Array(row(2)), Array(row(3))))
res7: org.apache.spark.rdd.RDD[Array[Array[Any]]] = MapPartitionsRDD[39] at map at <console>:32
You can proceed the same way for your twelve columns 您可以以同样的方式处理十二列
Updated 更新
Your updated question is more clear. 您更新的问题更加清楚。 So you can use
collect_list
function before you convert into an rdd
and carry on as before. 因此,在转换为
rdd
并像以前一样进行之前,可以使用collect_list
函数。
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> val rdd = df.select(collect_list("col1"), collect_list("col2"), collect_list("col3"), collect_list("col4")).rdd.map(row => Array(row(0), row(1), row(2), row(3)))
rdd: org.apache.spark.rdd.RDD[Array[Any]] = MapPartitionsRDD[41] at map at <console>:36
scala> rdd.map(array => array.map(element => println(element))).collect
[Stage 11:> (0 + 0) / 2]WrappedArray(1, 1)
WrappedArray(2, 2)
WrappedArray(3, 3)
WrappedArray(4, 4)
res8: Array[Array[Unit]] = Array(Array((), (), (), ()))
Dataframe only 仅数据框
You can do all of these in a dataframe itself and do not need to convert to rdd 您可以在数据框本身中完成所有这些操作,而无需转换为rdd
given that you have dataframe as 鉴于您有数据框为
scala> df.show(false)
+----+----+----+----+----+----+
|col1|col2|col3|col4|col5|col6|
+----+----+----+----+----+----+
|1 |2 |3 |4 |5 |6 |
|1 |2 |3 |4 |5 |6 |
+----+----+----+----+----+----+
You can simply do the following 您可以简单地执行以下操作
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> df.select(array(collect_list("col1"), collect_list("col2"), collect_list("col3"), collect_list("col4")).as("collectedArray")).show(false)
+--------------------------------------------------------------------------------+
|collectedArray |
+--------------------------------------------------------------------------------+
|[WrappedArray(1, 1), WrappedArray(2, 2), WrappedArray(3, 3), WrappedArray(4, 4)]|
+--------------------------------------------------------------------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.