以新的表/数据帧格式有效地将 spark 数据帧列转置/分解为行 [pyspark]

Question

How to efficiently explode a pyspark dataframe in this way:如何以这种方式有效地分解 pyspark 数据框：

+----+-------+------+------+
| id |sport  |travel| work |
+----+-------+------+------+
| 1  | 0.2   | 0.4  | 0.6  |
+----+-------+------+------+
| 2  | 0.7   | 0.9  | 0.5  |
+----+-------+------+------+

and my desired output is this:我想要的输出是这样的：

+------+--------+  
| c_id | score  |  
+------+--------+  
| 1    | 0.2    |  
+------+--------+  
| 1    | 0.4    |  
+------+--------+  
| 1    | 0.6    |  
+------+--------+  
| 2    | 0.7    |  
+------+--------+  
| 2    | 0.9    |  
+------+--------+  
| 2    | 0.5    |  
+------+--------+

Answer 1

First you could put your 3 columns in an array , then arrays_zip them and then explode them and unpack them with .* , then select and rename unzipped column.首先，你可以把你的3列在一个array ，然后arrays_zip它们，然后explode他们和他们解压.* ，然后select和重命名解压缩列。

df.withColumn("zip", F.explode(F.arrays_zip(F.array("sport","travel","work"))))\
  .select("id", F.col("zip.*")).withColumnRenamed("0","score").show()

+---+-----+
| id|score|
+---+-----+
|  1|  0.2|
|  1|  0.4|
|  1|  0.6|
|  2|  0.7|
|  2|  0.9|
|  2|  0.5|
+---+-----+

You can also do this without arrays_zip(as mentioned by cPak).您也可以在没有 arrays_zip 的情况下执行此操作（如 cPak 所述）。 Arrays_zip is used for combining arrays in different dataframe columns to struct form, so that you can explode all of them together, and then select with .* . Arrays_zip 用于将不同数据帧列中的数组组合为结构体形式，以便您可以将它们全部分解在一起，然后使用 .* 进行选择。 For this case you could just use:对于这种情况，您可以使用：

df.withColumn("score", F.explode((F.array(*(x for x in df.columns if x!="id"))))).select("id","score").show()

以新的表/数据帧格式有效地将 spark 数据帧列转置/分解为行 [pyspark]

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-03-16 16:12:18

以新的表/数据帧格式有效地将 spark 数据帧列转置/分解为行 [pyspark]

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-03-16 16:12:18

解决方案1
2 已采纳 2020-03-16 16:12:18