Spark：如何在单台机器上管理大型aggregatyByKey

Question

我正在使用Scala和Spark来管理大量记录，并且每个记录都具有以下形式：

single record => (String, Row)

每个Row由45种不同的值（ String ， Integer和Long ）组成。

为了汇总它们，我正在使用：

myRecords.aggregateByKey ( List [Any]( ) ) (
      (aggr, value) => aggr ::: (value :: Nil),
      (aggr1, aggr2) => aggr1 ::: aggr2
)

问题是我不断收到消息：

15/11/21 17:54:14 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 147767 ms exceeds timeout 120000 ms

15/11/21 17:54:14 ERROR TaskSchedulerImpl: Lost executor driver on localhost: Executor heartbeat timed out after 147767 ms

[Stage 3:====>               (875 + 24) / 3252]

15/11/21 17:57:10 WARN BlockManager: Putting block rdd_14_876 failed

...and finally...

15/11/21 18:00:27 ERROR Executor: Exception in task 876.0 in stage 3.0 (TID 5465)
java.lang.OutOfMemoryError: GC overhead limit exceeded

我可以猜测的是，聚合是如此之大，以至于要匹配新记录的键，它将需要越来越多的时间，直到某个任务超时为止，因为它找不到添加记录值的正确位置。

我使用了来自spark-submit不同参数，例如：

spark.default.parallelism => to reduce the size of tasks augmenting this value

spark.executor.memory => usually I put much less then driver memory

spark.driver.memory => the whole driver memory (single machine tho)

--master local[number of cores]

任何想法如何在没有内存不足/超时的情况下结束流程？

UPDATE

我试图基于以下内容合并两个csv文件：

1）根据一个csv列将它们联接起来2）基于3列值合并联接行3）使用2的键将这个联接和合并的文件聚合/分组3）4）对来自3）的单个聚合数据进行一些处理

这是代码：

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.storage.StorageLevel._
import org.apache.spark.sql.{Column, DataFrame, Row, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}

object MyRecords {

  def createKey(k1: String, k2: String, k3: String):String = {
    Seq(k1, k2, k3).iterator.map ( r => if (r == null) "" else r.trim.toUpperCase ).mkString ("")
  }

  def main(args: Array[String]): Unit = {

    val df1FilePath = args ( 0 )
    val df2FilePath = args ( 1 )

    val sc = new SparkContext ( new SparkConf ( ) )
    val sqlContext = new SQLContext ( sc )
    import sqlContext.implicits._

    val df1 = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("delimiter", "\t").load(df1FilePath).as("one")

    df1.registerTempTable("df1")

    val df2 = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("delimiter", "\t").load(df2FilePath)

    val df2Renamed = df2.select(
      col ( "v0" ).as ( "y_v0" ),
      col ( "v1" ).as ( "y_v1" ),
      col ( "v2" ).as ( "y_v2" ),
      col ( "v3" ).as ( "y_v3" ),
      col ( "v4" ).as ( "y_v4" ),
      col ( "v5" ).as ( "y_v5" ),
      col ( "v6" ).as ( "y_v6" ),
      col ( "v7" ).as ( "y_v7" ),
      col ( "v8" ).as ( "y_v8" ),
      col ( "v9" ).as ( "y_v9" ),
      col ( "v10" ).as ( "y_v10" ),
      col ( "v11" ).as ( "y_v11" ),
      col ( "v12" ).as ( "y_v12" ),
      col ( "v13" ).as ( "y_v13" ),
      col ( "v14" ).as ( "y_v14" ),
      col ( "v15" ).as ( "y_15" ),
      col ( "v16" ).as ( "y_16" ),
      col ( "v17" ).as ( "y_17" ),
      col ( "v18" ).as ( "y_18" ),
      col ( "v19" ).as ( "y_19" ),
      col ( "v20" ).as ( "y_20" ),
      col ( "v21" ).as ( "y_21" ),
      col ( "v22" ).as ( "y_22" ),
      col ( "v23" ).as ( "y_23" ),
      col ( "v24" ).as ( "y_24" ),
      col ( "v25" ).as ( "y_25" ),
      col ( "v26" ).as ( "y_26" ),
      col ( "v27" ).as ( "y_27" ),
      col ( "v28" ).as ( "y_28" ),
      col ( "v29" ).as ( "y_29" ),
      col ( "v30" ).as ( "y_30" ),
      col ( "v31" ).as ( "y_31" ),
      col ( "v32" ).as ( "y_32" )
    ).as("two")

    df2Renamed.registerTempTable("df2")

    val dfJoined = dfArchive.join( df2Renamed, $"one.v0" === $"two.y_v0", "fullouter" ).as("j")

    dfJoined.registerTempTable("joined")

    val dfMerged = sqlContext.sql("SELECT * FROM joined").map(r =>
      if (r.getAs("y_v1") != null) {
        (createKey (r.getAs("y_v2"), r.getAs("y_v3"), r.getAs("y_v4") ), r)
      } else {
        (createKey (r.getAs("v2"), r.getAs("v3"), r.getAs("v4") ), r)
      })

    dfMerged.groupByKey().collect().foreach(println)

    sc.stop()
  }
}

Answer 1

由于您在此处所做的全部工作都是按键分组，因此最好使用groupByKey而不是aggregateByKey ，尤其是这样一个会创建大量临时对象（如value :: Nil （为什么不简单地实现value :: aggr吗？）的对象）。

由于它不执行地图端聚合，因此应减少对垃圾收集器的压力（请参阅SPARK-772 ）。

另请参见： groupByKey是否比reduceByKey更受青睐

编辑：

对于您在更新中提供的代码，这实际上没有任何意义。 如果要使用DataFrames ，则没有理由首先使用RDDs对数据进行RDDs 。 此外，您还可以通过保留Strings和强制转换值来复制数据，从而增加内存使用率并增加GC压力。 看起来您所需要的大致是这样的（在spark-csv帮助下）：

// Load data, optionally add .option("inferSchema", "true")
val df1 = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true")
    .option("delimiter", "\t")
    .load(file1Path)

val df2 = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true")
    .option("delimiter", "\t")
    .load(file2Path)

// Join and cache
val df = df1.join(
  df2,
  // Join condition
  df1("foo") === df2("foo") &&
    df1("bar") === df2("bar") &&
    df1("baz") === df2("baz"),
  "fullouter")
df.registerTempTable("df")
sqlContext.cacheTable("df")

// Perform all the required casting using safe cast methods
// and replace existing columns
df.withColumn("some_column", $"some_column".cast(IntegerType))

您可以执行的任何聚合都可以在数据帧上执行，而无需对数据进行物理分组。 如果要子集化，只需使用where或filter 。

Spark：如何在单台机器上管理大型aggregatyByKey

问题描述

1 个解决方案

解决方案1
2 2015-11-21 18:33:42

Spark：如何在单台机器上管理大型aggregatyByKey

问题描述

1 个解决方案

解决方案1 2 2015-11-21 18:33:42

解决方案1
2 2015-11-21 18:33:42