简体   繁体   English

Spark RDD和Dataframe变换优化

[英]Spark RDD and Dataframe transformation optimisation

I am new to Spark and have the following high level question about RDDs and Dataframes which if im not mistaken are built on top of RDDs:我是 Spark 的新手,并且有以下关于 RDD 和 Dataframe 的高级问题,如果我没记错的话,它们是建立在 RDD 之上的:

I understand that there are two types of operations that can be done on RDD's, transformations and actions.我知道可以对 RDD 进行两种类型的操作,即转换和操作。 I also understand that transformations are only executed when an action is performed on an RDD that is a product of that transformation.我也明白,只有在对作为该转换产品的 RDD 执行操作时,才会执行转换。 Given that RDD's are in memory, I was wondering if there was some possibility of optimising the amount of memory consumed by these RDDs, take the following example:鉴于 RDD 在 memory 中,我想知道是否有可能优化这些 RDD 消耗的 memory 的数量,举个例子:

KafkaDF = KafkaDFRaw.select(
        KafkaDFRaw.key,
        KafkaDFRaw.value,
        KafkaDFRaw.topic,
        unix_timestamp('timestamp',
                       'yyyy-MM-dd HH:mm:ss').alias('kafka_arrival_time')
    ).withColumn("spark_arrival_time", udf(time.time, DoubleType())())

I have a KafkaDFRaw dataframe and I produce a new RDD called KafkaDF.我有一个 KafkaDFRaw dataframe 并且我生产了一个名为 KafkaDF 的新 RDD。 I then wish to add columns to this new RDD.然后我希望将列添加到这个新的 RDD。 Should I add them to the existing RDD?我应该将它们添加到现有的 RDD 中吗? Like so:像这样:

decoded_value_udf = udf(lambda value: value.decode("utf-8"))
    KafkaDF = KafkaDF\
        .withColumn(
            "cleanKey", decoded_value_udf(KafkaDF.key))\
        .withColumn(
            "cleanValue", decoded_value_udf(KafkaDF.value))

Or should I create a new dataframe from the last one?或者我应该从最后一个创建一个新的 dataframe ? Like so:像这样:

decoded_value_udf = udf(lambda value: value.decode("utf-8"))
    KafkaDF_NEW = KafkaDF\
        .withColumn(
            "cleanKey", decoded_value_udf(KafkaDF.key))\
        .withColumn(
            "cleanValue", decoded_value_udf(KafkaDF.value))

Does this make a difference in terms of memory optimisation?这对 memory 优化有影响吗?

Thank you in advance for your help.预先感谢您的帮助。

Whenever the action is called, the optimized dag gets executed and the memory is used as per the plan.每当调用该操作时,都会执行优化的 dag,并按照计划使用 memory。 You can compare the execution plans to understand:可以对比执行计划来了解:

df.explain(true)
df_new.explain(true)

Creating extra variable in between to hold the transformations does not impact the memory utilization.在两者之间创建额外的变量来保存转换不会影响 memory 的利用率。 Memory requirements will depend on data size, partition size, shuffles etc. Memory 要求将取决于数据大小、分区大小、洗牌等。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM