在简单的字符串拆分操作上，Spark Dataframe本机性能与Pyspark RDD映射

Question

I don't expect the following code to benefit from the Dataframe Catalyst query optimizer, but I do expect there to be a performance difference between the Scala/native performance of string split and the Python performance. 我不希望以下代码从Dataframe Catalyst查询优化器中受益，但我确实希望字符串拆分的Scala /本机性能与Python性能之间存在性能差异。 However, my performance results are disappointing, as the native Dataframe API appears to be slower. 但是，我的性能结果令人失望，因为原生Dataframe API似乎更慢。

My test is as follows: 我的测试如下：

def get_df(spark):
    return spark.read.load('s3://BUCKET/test-data.csv',
                           format='com.databricks.spark.csv',
                           inferSchema='true', header='true')

def upsize_df(df, exponent=10):
    for i in range(exponent):
        df = df.unionAll(df)
    return df

def rdd_ver(df):
    df = df.rdd.map(lambda row: row + tuple(
                        row.order_id.split('-'))).toDF(
                            df.columns + ['psrid', 'eoid'])
    df.show()

def df_ver(df):
    split_col = pyspark.sql.functions.split(df['order_id'], '-')
    df = df.withColumn('psrid', split_col.getItem(0))
    df = df.withColumn('eoid', split_col.getItem(1))
    df.show()

Cluster/YARN details: 群集/ YARN详细信息：

Spark 2.0 on AWS AWS上的Spark 2.0
6 executors 6位执行人
2 cores per executor 每个执行人2个核心

Test procedure: 测试程序：

Create new PySpark shell in IPython 在IPython中创建新的PySpark shell
Get dataframe of toy-sized dataset (1000 rows) 获取玩具大小的数据集（1000行）的数据框
repartition Dataframe to 12 partitions repartition Dataframe repartition为12个分区
upsize_df with unionAll , to get to 1 million rows upsize_df与unionAll ，获得100万行
run df.count() to force execution of repartition and upsize_df 运行df.count()以强制执行repartition和upsize_df
finally, run %time rdd_ver(df) or %time df_ver(df) 最后，运行%time rdd_ver(df)或%time df_ver(df)

My results so far have been surprising and disappointing. 到目前为止，我的结果令人惊讶和失望。 Here is a sampling of the results I've received, in seconds: 以下是我收到的结果的示例，以秒为单位：

rdd_ver : 14.5, 22.4, 13.1, 24.7, 17.8 --- mean: 18.5 rdd_ver ： rdd_ver --- mean: 18.5

df_ver : 30.5, 26.9, 32.0, 29.7, 39.8 --- mean: 31.8 df_ver ： df_ver --- mean: 31.8

I'd appreciate any thoughts, either on the test procedure itself (the operation itself is derived from some production code) or on the poor performance of the Dataframe version. 我很感激任何想法，无论是在测试过程本身（操作本身是从一些生产代码派生）还是在Dataframe版本的糟糕性能上。

EDIT: 编辑：

The Spark Web UI indicates that my jobs are not actually being scheduled/submitted very quickly. Spark Web UI表明我的作业实际上并未快速安排/提交。 I am not sure how reliable the Web UI's information is, but the 'Submitted' time displayed on the active job in this screenshot is over a minute after I initially hit 'enter' in the active Pyspark session to kick off %time df_ver(df) 我不确定Web UI的信息有多可靠，但是在我最初在活动的Pyspark会话中点击'enter'以启动%time df_ver(df)后，此屏幕截图中活动作业上显示的'已提交'时间超过一分钟%time df_ver(df)

Furthermore, it seems that none of the 6 executors are doing anything. 此外，似乎6位遗嘱执行人都没有做任何事。 They've all apparently been killed by Spark since I wasn't actively doing anything in the Spark session for more than a few seconds. 他们显然已被Spark杀死，因为我没有在Spark会话中积极做任何事情超过几秒钟。 It seems like the entire job is being run by the driver node, but I can't confirm that since I don't know the Spark Web UI well enough. 似乎整个作业都是由驱动程序节点运行的，但我无法确认，因为我不太了解Spark Web UI。

Answer 1

Why do you think it should be faster in scala? 为什么你认为scala应该更快？ Python string operations are very fast: Python字符串操作非常快：

Python: 蟒蛇：

In [58]: %time "this is my string".split()
CPU times: user 5 µs, sys: 1 µs, total: 6 µs
Wall time: 7.87 µs

Scala: 斯卡拉：

bash-3.2$ echo '
object TimeSplit {
   def main(args: Array[String]): Unit = {
     val now = System.nanoTime
     val split = "this is my string".split(" ")
     val diff = System.nanoTime - now
     println("%d microseconds".format(diff/1000))
   }
 }' > timesplit.scala

bash-3.2$ scalac timesplit.scala
bash-3.2$ scala TimeSplit
21 microseconds

在简单的字符串拆分操作上，Spark Dataframe本机性能与Pyspark RDD映射

问题描述

1 个解决方案

解决方案1
0 2016-08-31 23:55:33

Python: 蟒蛇：

Scala: 斯卡拉：

在简单的字符串拆分操作上，Spark Dataframe本机性能与Pyspark RDD映射

问题描述

1 个解决方案

解决方案1 0 2016-08-31 23:55:33

Python: 蟒蛇：

Scala: 斯卡拉：

解决方案1
0 2016-08-31 23:55:33