简体   繁体   English

函数使用rdd.map()映射到RDD的某些行被多次调用

[英]Function mapped to RDD using rdd.map() called multiple times for some rows

I have a source dataframe which has some records. 我有一个有一些记录的源数据框。 I want to perform some operation on each row of this dataframe. 我想对该数据帧的每一行执行一些操作。 For this purpose, the rdd.map function was used. 为此,使用了rdd.map函数。 However, looking at the logs recorded using accumulators, looks like the mapped function was called multiple times for some rows. 但是,查看使用累加器记录的日志,看起来某些行多次调用了映射函数。 As per the documentation, it should be called once ONLY. 根据文档,仅应调用一次。

I tried replicating the issue in a small script and noticed the same behavior. 我尝试在一个小的脚本中复制问题,并注意到相同的行为。 This script is shown below: 该脚本如下所示:

import os
import sys
os.environ['SPARK_HOME'] = "/usr/lib/spark/"
sys.path.append("/usr/lib/spark/python/")
from pyspark.sql import *
from pyspark.accumulators import AccumulatorParam


class StringAccumulatorParam(AccumulatorParam):
    def zero(self, initialValue=""):
        return ""

    def addInPlace(self, s1, s2):
        return s1.strip() + " " + s2.strip()

def mapped_func(row, logging_acc):
    logging_acc += "Started map"
    logging_acc += str(row)
    return "test"

if __name__ == "__main__":
    spark_session = SparkSession.builder.enableHiveSupport().appName("rest-api").getOrCreate()
    sc = spark_session.sparkContext
    df = spark_session.sql("select col1, col2, col3, col4, col5, col6 from proj1_db.dw_table where col3='P1'")
    df.show()
    logging_acc = sc.accumulator("", StringAccumulatorParam())
    result_rdd = df.rdd.map(lambda row: Row(row, mapped_func(row, logging_acc)))
    result_rdd.toDF().show()
    print "logs: " + str(logging_acc.value)

Below is the relevant piece of output: 以下是相关的输出:

+----+----+----+----+----+----+
|col1|col2|col3|col4|col5|col6|
+----+----+----+----+----+----+
|   1|   1|  P1|   2|  10|  20|
|   3|   1|  P1|   1|  25|  25|
+----+----+----+----+----+----+

+--------------------+----+
|                  _1|  _2|
+--------------------+----+
|[1, 1, P1, 2, 10,...|test|
|[3, 1, P1, 1, 25,...|test|
+--------------------+----+

logs: Started map Row(col1=1, col2=1, col3=u'P1', col4=2, col5=10, col6=20) Started map Row(col1=1, col2=1, col3=u'P1', col4=2, col5=10, col6=20) Started map Row(col1=3, col2=1, col3=u'P1', col4=1, col5=25, col6=25)

The first table is the source dataframe and the second table is the resultant dataframe created post the map function call. 第一个表是源数据帧,第二个表是在映射函数调用之后创建的结果数据帧。 As seen, the function is being called twice for the first row. 如图所示,该函数在第一行被调用两次。 Can anyone please help me understand what is happening and how can we make sure the mapped function is called only ONCE per row. 谁能帮助我了解正在发生的事情,以及如何确保映射的函数每行仅调用一次。

As per the documentation, it should be called once ONLY. 根据文档,仅应调用一次。

That's really not the case. 事实并非如此。 Any transformation can be executed arbitrary number of times (typically in case of failures or to support secondary logic) and the documentation says explicitly that : 任何转换都可以执行任意次(通常在发生故障或支持辅助逻辑的情况下), 文档明确指出

For accumulator updates performed inside actions only , Spark guarantees that each task's update to the accumulator will only be applied once 对于操作内部执行的累加器更新,Spark保证每个任务对累加器的更新将仅应用一次

So implicitly accumulators used inside transformations (like map ) can be updated multiple times per tasks. 因此,可以在每个任务中多次更新转换中使用的隐式累加器(例如map )。

In your case multiple executions happen because you don't provide schema when you convert RDD to DataFrame . 在您的情况下,会发生多次执行,因为将RDD转换为DataFrame时不提供架构。 In such case Spark will perform another data scan to infer schema from data, ie 在这种情况下,Spark将执行另一次数据扫描以从数据推断架构,即

spark.createDataFrame(result_rdd, schema)

That however will only address this particular issue, and general point about transformation and accumulator behavior stands. 但是,那只能解决这个特定的问题,并且关于转换和累加器行为的一般观点仍然存在。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM