创建一个包含数千列的 Spark dataframe，然后添加一个包含所有列的 ArrayType 列

Question

I'd like to create a dataframe in Spark with Scala code like this:我想在 Spark 中使用 Scala 代码创建一个 dataframe ，如下所示：

col_1 col_1	col_2 col_2	col_3 col_3	.. ..	col_2048 col_2048
0.123 0.123	0.234 0.234	... ...	... ...	0.323 0.323
0.345 0.345	0.456 0.456	... ...	... ...	0.534 0.534

Then add an extra column of ArrayType to it, that holds all these 2048 columns data in one column:然后向其中添加一个额外的 ArrayType 列，将所有这些 2048 列数据保存在一列中：

col_1 col_1	col_2 col_2	col_3 col_3	.. ..	col_2048 col_2048	array_col array_col
0.123 0.123	0.234 0.234	... ...	... ...	0.323 0.323	[0,123, 0.234, ..., 0.323] [0,123, 0.234, ..., 0.323]
0.345 0.345	0.456 0.456	... ...	... ...	0.534 0.534	[0.345, 0.456, ..., 0.534] [0.345, 0.456, ..., 0.534]

Answer 1

try this尝试这个

df.withColumn("array_col",array(df.columns.map(col): _*)).show

Answer 2

PySpark: PySpark：

Create column list and use python map.创建列列表并使用 python map。

cols = df.columns

df.withColumn('array_col', f.array(*map(lambda c: f.col(c), cols)))

创建一个包含数千列的 Spark dataframe，然后添加一个包含所有列的 ArrayType 列

问题描述

2 个解决方案

解决方案1
2 已采纳 2021-11-30 10:46:51

解决方案2
1 2021-11-30 10:12:15

创建一个包含数千列的 Spark dataframe，然后添加一个包含所有列的 ArrayType 列

问题描述

2 个解决方案

解决方案1 2 已采纳 2021-11-30 10:46:51

解决方案2 1 2021-11-30 10:12:15

解决方案1
2 已采纳 2021-11-30 10:46:51

解决方案2
1 2021-11-30 10:12:15