有效地从pyspark dataframe列创建新数据框

Question

I wonder what is the most efficient way to extract a column in pyspark dataframe and turn them into a new dataframe? 我想知道在pyspark数据框中提取列并将其转换为新数据框的最有效方法是什么？ The following code runs without any problem with small datasets, but runs very slow and even causes out-of-memory error. 以下代码在使用小型数据集时不会出现任何问题，但运行速度非常慢，甚至会导致内存不足错误。 I wonder how can I improve the efficiency of this code? 我想知道如何提高这段代码的效率吗？

pdf_edges = sdf_grp.rdd.flatMap(lambda x: x).collect()  
edgelist = reduce(lambda a, b: a + b, pdf_edges, [])
sdf_edges = spark.createDataFrame(edgelist)

In pyspark dataframe sdf_grp , The column "pairs" contains information as below 在pyspark dataframe sdf_grp中 ，“对”列包含以下信息

+-------------------------------------------------------------------+
|pairs                                                              |
+-------------------------------------------------------------------+
|[[39169813, 24907492], [39169813, 19650174]]                       |
|[[10876191, 139604770]]                                            |
|[[6481958, 22689674]]                                              |
|[[73450939, 114203936], [73450939, 21226555], [73450939, 24367554]]|
|[[66306616, 32911686], [66306616, 19319140], [66306616, 48712544]] |
+-------------------------------------------------------------------+

with a schema of 具有

root
|-- pairs: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- node1: integer (nullable = false)
|    |    |-- node2: integer (nullable = false)

I'd like to convert them into a new dataframe sdf_edges looks like below 我想将它们转换为新的数据框sdf_edges如下所示

+---------+---------+
|    node1|    node2|
+---------+---------+
| 39169813| 24907492|
| 39169813| 19650174|
| 10876191|139604770|
|  6481958| 22689674|
| 73450939|114203936|
| 73450939| 21226555|
| 73450939| 24367554|
| 66306616| 32911686|
| 66306616| 19319140|
| 66306616| 48712544|
+---------+---------+

Answer 1

The most efficient way to extract columns is avoiding collect() . 提取列的最有效方法是避免collect（）。 When you call collect(), all the data is transfered to the driver and processed there. 当您调用collect（）时，所有数据都将传输到驱动程序并在那里进行处理。 At better way to achieve what you want is using the explode() function. 要实现所需的更好方法是使用explode（）函数。 Have a look at the example below: 看下面的例子：

from pyspark.sql import types as T
import pyspark.sql.functions as F

schema = T.StructType([
  T.StructField("pairs", T.ArrayType(
      T.StructType([
          T.StructField("node1", T.IntegerType()),
          T.StructField("node2", T.IntegerType())
      ])
   )
   )
])


df = spark.createDataFrame(
[
([[39169813, 24907492], [39169813, 19650174]],),
([[10876191, 139604770]],        )                                    ,
([[6481958, 22689674]]      ,     )                                   ,
([[73450939, 114203936], [73450939, 21226555], [73450939, 24367554]],),
([[66306616, 32911686], [66306616, 19319140], [66306616, 48712544]],)
], schema)

df = df.select(F.explode('pairs').alias('exploded')).select('exploded.node1', 'exploded.node2')
df.show(truncate=False)

Output: 输出：

+--------+---------+ 
|  node1 |   node2 | 
+--------+---------+ 
|39169813|24907492 | 
|39169813|19650174 | 
|10876191|139604770| 
|6481958 |22689674 | 
|73450939|114203936| 
|73450939|21226555 | 
|73450939|24367554 | 
|66306616|32911686 | 
|66306616|19319140 | 
|66306616|48712544 | 
+--------+---------+

Answer 2

好吧，我用下面的方法解决

sdf_edges = sdf_grp.select('pairs').rdd.flatMap(lambda x: x[0]).toDF()

有效地从pyspark dataframe列创建新数据框

问题描述

2 个解决方案

解决方案1
1 已采纳 2019-06-26 10:52:28

解决方案2
0 2019-06-26 10:59:38

有效地从pyspark dataframe列创建新数据框

问题描述

2 个解决方案

解决方案1 1 已采纳 2019-06-26 10:52:28

解决方案2 0 2019-06-26 10:59:38

解决方案1
1 已采纳 2019-06-26 10:52:28

解决方案2
0 2019-06-26 10:59:38