遍历PySpark DataFrame和创建新列的更有效方法

Question

I am converting some code written with Pandas to PySpark. 我正在将用Pandas编写的一些代码转换为PySpark。 The code has a lot of for loops to create a variable number of columns depending on user-specified inputs. 该代码具有许多for循环，可根据用户指定的输入来创建可变数量的列。

I'm using Spark 1.6.x, with the following sample code: 我正在使用带有以下示例代码的Spark 1.6.x：

from pyspark.sql import SQLContext
from pyspark.sql import functions as F
import pandas as pd
import numpy as np

# create a Pandas DataFrame, then convert to Spark DataFrame
test = sqlContext.createDataFrame(pd.DataFrame({'val1': np.arange(1,11)}))

Which leaves me with 这让我

+----+
|val1|
+----+
|   1|
|   2|
|   3|
|   4|
|   5|
|   6|
|   7|
|   8|
|   9|
|  10|
+----+

I loop a lot in the code, for example the below: 我在代码中循环很多，例如以下代码：

for i in np.arange(2,6).tolist():
    test = test.withColumn('val_' + str(i), F.lit(i ** 2) + test.val1)

Which results in: 结果是：

+----+-----+-----+-----+-----+
|val1|val_2|val_3|val_4|val_5|
+----+-----+-----+-----+-----+
|   1|    5|   10|   17|   26|
|   2|    6|   11|   18|   27|
|   3|    7|   12|   19|   28|
|   4|    8|   13|   20|   29|
|   5|    9|   14|   21|   30|
|   6|   10|   15|   22|   31|
|   7|   11|   16|   23|   32|
|   8|   12|   17|   24|   33|
|   9|   13|   18|   25|   34|
|  10|   14|   19|   26|   35|
+----+-----+-----+-----+-----+

**Question: ** How can I rewrite the above loop to be more efficient? **问题：**如何重写以上循环以提高效率？

I've noticed that my code runs slower as Spark spends a lot of time on each group of loops (even on small datasets like 2GB of text input). 我注意到我的代码运行速度较慢，因为Spark在每组循环上花费很多时间（甚至在2GB文本输入之类的小型数据集上）。

Thanks 谢谢

Answer 1

There is a small overhead of repeatedly calling JVM method but otherwise for loop alone shouldn't be a problem. 重复调用JVM方法的开销很小，但否则，仅for循环就不成问题。 You can improve it slightly by using a single select: 您可以通过使用一次选择来稍微改善它：

df = spark.range(1, 11).toDF("val1")

def make_col(i):
    return (F.pow(F.lit(i), 2) + F.col("val1")).alias("val_{0}".format(i))

spark.range(1, 11).toDF("val1").select("*", *(make_col(i) for i in range(2, 6)))

I would also avoid using NumPy types. 我也将避免使用NumPy类型。 Initializing NumPy objects is typically more expensive compared to plain Python objects and Spark SQL doesn't support NumPy types so there some additional conversions required. 与普通的Python对象相比，初始化NumPy对象通常会更昂贵，并且Spark SQL不支持NumPy类型，因此需要进行一些其他转换。

Answer 2

One withColumn will work on entire rdd. 一个withColumn将对整个rdd起作用。 So generally its not a good practise to use the method for every column you want to add. 因此，通常对于要添加的每一列都使用该方法不是一个好习惯。 There is a way where you work with columns and their data inside a map function. 您可以通过一种方法在地图函数中使用列及其数据。 Since one map function is doing the job here, the code to add new column and its data will be done in parallel. 由于此处有一个map函数正在执行此操作，因此添加新列及其数据的代码将并行执行。

a. 一种。 you can gather new values based on the calculations 您可以根据计算结果收集新值

b. b。 Add these new column values to main rdd as below 将这些新列值添加到main rdd中，如下所示

val newColumns: Seq[Any] = Seq(newcol1,newcol2)
Row.fromSeq(row.toSeq.init ++ newColumns)

Here row, is the reference of row in map method 这里的row是map方法中该行的引用

c. C。 Create new schema as below 如下创建新架构

val newColumnsStructType = StructType{Seq(new StructField("newcolName1",IntegerType),new StructField("newColName2", IntegerType))

d. d。 Add to the old schema 添加到旧模式

val newSchema = StructType(mainDataFrame.schema.init ++ newColumnsStructType)

e. e。 Create new dataframe with new columns 使用新列创建新数据框

val newDataFrame = sqlContext.createDataFrame(newRDD, newSchema)

遍历PySpark DataFrame和创建新列的更有效方法

问题描述

2 个解决方案

解决方案1
3 已采纳 2016-10-15 12:06:34

解决方案2
0 2016-10-14 20:55:25

遍历PySpark DataFrame和创建新列的更有效方法

问题描述

2 个解决方案

解决方案1 3 已采纳 2016-10-15 12:06:34

解决方案2 0 2016-10-14 20:55:25

解决方案1
3 已采纳 2016-10-15 12:06:34

解决方案2
0 2016-10-14 20:55:25