递归数据框操作

Question

In my spark application I would like to do operations on a dataframe in a loop and write the result to hdfs. 在我的spark应用程序中，我想对一个数据帧进行循环操作，并将结果写入hdfs。

pseudocode: 伪代码：

var df = emptyDataframe
for n = 1 to 200000{
  someDf=read(n)
  df = df.mergeWith(somedf)
}
df.writetohdfs

In the above example I get good results when "mergeWith" does a unionAll. 在上面的示例中，当“ mergeWith”执行unionAll时，我会获得良好的结果。

However, when in "mergeWith" I do a (simple) join, the job gets really slow (>1h with 2 executors with 4 cores each) and never finishes (job aborts itself). 但是，当我在“ mergeWith”中执行一个（简单的）连接时，该工作会变得非常缓慢（> 1h，且每个执行程序有2个执行器，每个执行器具有4个内核），并且永远不会完成（作业会中止自身）。

In my scenario I throw in ~50 iterations with files that just contain ~1mb of text data. 在我的场景中，我使用仅包含约1mb文本数据的文件进行了约50次迭代。

Because order of merges is important in my case, I'm suspecting this is due to the DAG generation, causing the whole thing to be run at the moment I store away the data. 因为在我的情况下合并顺序很重要，所以我怀疑这是由于DAG的产生，导致整个过程在存储数据的那一刻就可以运行了。

Right now I'm attempting to use a .persist on the merged data frame but that also seems to go rather slowly. 现在，我正在尝试在合并的数据帧上使用.persist，但这似乎也很慢。

EDIT: 编辑：

As the job was running i noticed (even though I did a count and .persist) the dataframe in memory didn't look like a static dataframe. 当作业正在运行时，我注意到（即使我进行了计数和.persist），内存中的数据帧看起来也不像静态数据帧。 It looked like a stringed together path to all the merges it had been doing, effectively slowing down the job linearly. 看起来像是一条通往所有合并的串连路径，有效地线性降低了工作速度。

Am I right to assume the var df is the culprit of this? 我是否可以假设var df是造成这种情况的罪魁祸首？

breakdown of the issue as I see it: 我看到的问题的细分：

dfA = empty
dfC = dfA.increment(dfB)
dfD = dfC.increment(dfN)....

When I would expect DF' AC and D are object, spark things differently and does not care if I persist or repartition or not. 当我期望DF'AC和D是对象时，以不同的方式发出火花，并且不在乎我是否坚持或重新分区。 to Spark it looks like this: 火花看起来像这样：

dfA = empty
dfC = dfA incremented with df B
dfD = ((dfA incremented with df B) incremented with dfN)....

Update2 UPDATE2

To get rid of the persisting not working on DF's I could "break" the lineage when converting the DF to and RDD and back again. 为了摆脱持续无法使用DF的问题，我可以在将DF转换为RDD并再次转换回DF时“破坏”谱系。 This has a little bit of an overhead but an acceptable one (job finishes in minutes rather than hours/never) I'll run some more tests on the persisting and formulate an answer in the form of a workaround. 这有一点点开销，但可以接受（工作在几分钟内完成，而不是数小时/从不完成）。我将对持久性进行更多测试，并以变通办法的形式提出答案。

Result: This only seems to fix these issues on the surface. 结果：这似乎只能解决这些问题。 In reality I'm back at square one and get OOM exceptions java.lang.OutOfMemoryError: GC overhead limit exceeded 实际上，我回到第一个方格并获得OOM异常java.lang.OutOfMemoryError: GC overhead limit exceeded

Answer 1

If you have code like this: 如果您有这样的代码：

var df = sc.parallelize(Seq(1)).toDF()

for(i<- 1 to 200000) {
  val df_add = sc.parallelize(Seq(i)).toDF()
  df = df.unionAll(df_add)
}

Then df will have 400000 partitions afterwards, which makes the following actions inefficient (because you have 1 tasks for each partition). 然后df之后将有400000个分区，这会使以下操作效率低下（因为每个分区有1个任务）。

Try to reduce the number of partitions to eg 200 before persisiting the dataframe (using eg df.coalesce(200).write.saveAsTable(....) ) 尝试在保留数据帧之前将分区数减少到200个（例如使用df.coalesce(200).write.saveAsTable(....) ）

Answer 2

So the following is what I ended up using. 因此，以下是我最终使用的内容。 It's performant enough for my usecase, it works and does not need persisting. 对于我的用例来说，它足够高效，可以正常工作，不需要持久化。

It is very much a workaround rather than a fix. 这不能不说是一个解决办法，而不是修复。

val mutableBufferArray = ArrayBuffer[DataFrame]()
mutableBufferArray.append(hiveContext.emptyDataframe())

for loop {

              val interm = mergeDataFrame(df, mutableBufferArray.last)
              val intermSchema = interm.schema
              val intermRDD = interm.rdd.repartition(8)


              mutableBufferArray.append(hiveContext.createDataFrame(intermRDD, intermSchema))
              mutableBufferArray.remove(0)

}

This is how I wrestle tungsten into compliance. 这是我的搏斗钨到合规性。 By going from a DF to an RDD and back I end up with a real object rather than a whole tungsten generated process pipe from front to back. 从DF到RDD，然后返回，最终得到的是真实的物体，而不是整个钨产生的从前到后的工艺管。

In my code I iterate a few times before writing out to disk (50-150 iterations seem to work best). 在我的代码中，我迭代了几次，然后再写到磁盘上（50-150次迭代似乎效果最好）。 That's where I clear out the bufferArray again to start over fresh. 那是我再次清除bufferArray以重新开始的地方。

递归数据框操作

问题描述

2 个解决方案

解决方案1
0 2016-11-23 12:28:55

解决方案2
0 已采纳 2016-12-01 12:07:58

递归数据框操作

问题描述

2 个解决方案

解决方案1 0 2016-11-23 12:28:55

解决方案2 0 已采纳 2016-12-01 12:07:58

解决方案1
0 2016-11-23 12:28:55

解决方案2
0 已采纳 2016-12-01 12:07:58