简体   繁体   English

Spark:如何在单台机器上管理大型aggregatyByKey

[英]Spark: How to manage a big aggregateByKey on a single machine

I am using Scala and Spark to manage a big quantity of records and each of those records has the form: 我正在使用Scala和Spark来管理大量记录,并且每个记录都具有以下形式:

single record => (String, Row)

and every Row is composed by 45 different kind of values ( String , Integer , Long ). 每个Row由45种不同的值( StringIntegerLong )组成。

To aggregate them I am using: 为了汇总它们,我正在使用:

myRecords.aggregateByKey ( List [Any]( ) ) (
      (aggr, value) => aggr ::: (value :: Nil),
      (aggr1, aggr2) => aggr1 ::: aggr2
)

The problem is that I am getting constanly the message: 问题是我不断收到消息:

15/11/21 17:54:14 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 147767 ms exceeds timeout 120000 ms

15/11/21 17:54:14 ERROR TaskSchedulerImpl: Lost executor driver on localhost: Executor heartbeat timed out after 147767 ms

[Stage 3:====>               (875 + 24) / 3252]

15/11/21 17:57:10 WARN BlockManager: Putting block rdd_14_876 failed

...and finally...

15/11/21 18:00:27 ERROR Executor: Exception in task 876.0 in stage 3.0 (TID 5465)
java.lang.OutOfMemoryError: GC overhead limit exceeded

What I can guess is that the aggregate is so big that to match the key for a new record it will require more and more time, till a task gets to some timeout because it didn't find the right place where add the record's value. 我可以猜测的是,聚合是如此之大,以至于要匹配新记录的键,它将需要越来越多的时间,直到某个任务超时为止,因为它找不到添加记录值的正确位置。

I played with different parameters from spark-submit like: 我使用了来自spark-submit不同参数,例如:

spark.default.parallelism => to reduce the size of tasks augmenting this value

spark.executor.memory => usually I put much less then driver memory

spark.driver.memory => the whole driver memory (single machine tho)

--master local[number of cores] 

Any idea how to get at the end of the process without out-of-memory/timeouts? 任何想法如何在没有内存不足/超时的情况下结束流程?

UPDATE UPDATE

I am trying to merge two csv files based on: 我试图基于以下内容合并两个csv文件:

1) join them based on a csv column 2) merge the join rows, based on 3 column values 3) aggregate/group this joined & merged files with the key at 2) 4) doing some stuff on the single aggregate data from 3) 1)根据一个csv列将它们联接起来2)基于3列值合并联接行3)使用2的键将这个联接和合并的文件聚合/分组3)4)对来自3)的单个聚合数据进行一些处理

This is the code: 这是代码:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.storage.StorageLevel._
import org.apache.spark.sql.{Column, DataFrame, Row, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}

object MyRecords {

  def createKey(k1: String, k2: String, k3: String):String = {
    Seq(k1, k2, k3).iterator.map ( r => if (r == null) "" else r.trim.toUpperCase ).mkString ("")
  }

  def main(args: Array[String]): Unit = {

    val df1FilePath = args ( 0 )
    val df2FilePath = args ( 1 )

    val sc = new SparkContext ( new SparkConf ( ) )
    val sqlContext = new SQLContext ( sc )
    import sqlContext.implicits._

    val df1 = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("delimiter", "\t").load(df1FilePath).as("one")

    df1.registerTempTable("df1")

    val df2 = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("delimiter", "\t").load(df2FilePath)

    val df2Renamed = df2.select(
      col ( "v0" ).as ( "y_v0" ),
      col ( "v1" ).as ( "y_v1" ),
      col ( "v2" ).as ( "y_v2" ),
      col ( "v3" ).as ( "y_v3" ),
      col ( "v4" ).as ( "y_v4" ),
      col ( "v5" ).as ( "y_v5" ),
      col ( "v6" ).as ( "y_v6" ),
      col ( "v7" ).as ( "y_v7" ),
      col ( "v8" ).as ( "y_v8" ),
      col ( "v9" ).as ( "y_v9" ),
      col ( "v10" ).as ( "y_v10" ),
      col ( "v11" ).as ( "y_v11" ),
      col ( "v12" ).as ( "y_v12" ),
      col ( "v13" ).as ( "y_v13" ),
      col ( "v14" ).as ( "y_v14" ),
      col ( "v15" ).as ( "y_15" ),
      col ( "v16" ).as ( "y_16" ),
      col ( "v17" ).as ( "y_17" ),
      col ( "v18" ).as ( "y_18" ),
      col ( "v19" ).as ( "y_19" ),
      col ( "v20" ).as ( "y_20" ),
      col ( "v21" ).as ( "y_21" ),
      col ( "v22" ).as ( "y_22" ),
      col ( "v23" ).as ( "y_23" ),
      col ( "v24" ).as ( "y_24" ),
      col ( "v25" ).as ( "y_25" ),
      col ( "v26" ).as ( "y_26" ),
      col ( "v27" ).as ( "y_27" ),
      col ( "v28" ).as ( "y_28" ),
      col ( "v29" ).as ( "y_29" ),
      col ( "v30" ).as ( "y_30" ),
      col ( "v31" ).as ( "y_31" ),
      col ( "v32" ).as ( "y_32" )
    ).as("two")

    df2Renamed.registerTempTable("df2")

    val dfJoined = dfArchive.join( df2Renamed, $"one.v0" === $"two.y_v0", "fullouter" ).as("j")

    dfJoined.registerTempTable("joined")

    val dfMerged = sqlContext.sql("SELECT * FROM joined").map(r =>
      if (r.getAs("y_v1") != null) {
        (createKey (r.getAs("y_v2"), r.getAs("y_v3"), r.getAs("y_v4") ), r)
      } else {
        (createKey (r.getAs("v2"), r.getAs("v3"), r.getAs("v4") ), r)
      })

    dfMerged.groupByKey().collect().foreach(println)

    sc.stop()
  }
}

Since all you do here is grouping by key it is better to use groupByKey instead of aggregateByKey , especially a one which creates a huge number of temporary objects like value :: Nil (why not simply value :: aggr ?). 由于您在此处所做的全部工作都是按键分组,因此最好使用groupByKey而不是aggregateByKey ,尤其是这样一个会创建大量临时对象(如value :: Nil (为什么不简单地实现value :: aggr吗?)的对象)。

Since it doesn't perform map side aggregations it should put less stress on a garbage collector (see SPARK-772 ). 由于它不执行地图端聚合,因此应减少对垃圾收集器的压力(请参阅SPARK-772 )。

See also: Is groupByKey ever preferred over reduceByKey 另请参见: groupByKey是否比reduceByKey更受青睐

Edit : 编辑

Regarding code you've provided in the update it doesn't really make sense. 对于您在更新中提供的代码,这实际上没有任何意义。 If you want to use DataFrames there is no reason to group data using RDDs in the first place. 如果要使用DataFrames ,则没有理由首先使用RDDs对数据进行RDDs Also you duplicate your data by keeping both Strings and casted values increasing memory usage and additionally stressing GC. 此外,您还可以通过保留Strings和强制转换值来复制数据,从而增加内存使用率并增加GC压力。 It looks like what you need is roughly something like this (with a small help of spark-csv ): 看起来您所需要的大致是这样的(在spark-csv帮助下):

// Load data, optionally add .option("inferSchema", "true")
val df1 = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true")
    .option("delimiter", "\t")
    .load(file1Path)

val df2 = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true")
    .option("delimiter", "\t")
    .load(file2Path)

// Join and cache
val df = df1.join(
  df2,
  // Join condition
  df1("foo") === df2("foo") &&
    df1("bar") === df2("bar") &&
    df1("baz") === df2("baz"),
  "fullouter")
df.registerTempTable("df")
sqlContext.cacheTable("df")

// Perform all the required casting using safe cast methods
// and replace existing columns
df.withColumn("some_column", $"some_column".cast(IntegerType))

Any aggregations you need you can perform you can execute on a data frame without physically grouping the data . 您可以执行的任何聚合都可以在数据帧上执行, 而无需对数据进行物理分组 If you want to subset simply use where or filter . 如果要子集化,只需使用wherefilter

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM