简体   繁体   中英

Recursive Dataframe operations

In my spark application I would like to do operations on a dataframe in a loop and write the result to hdfs.

pseudocode:

var df = emptyDataframe
for n = 1 to 200000{
  someDf=read(n)
  df = df.mergeWith(somedf)
}
df.writetohdfs

In the above example I get good results when "mergeWith" does a unionAll.

However, when in "mergeWith" I do a (simple) join, the job gets really slow (>1h with 2 executors with 4 cores each) and never finishes (job aborts itself).

In my scenario I throw in ~50 iterations with files that just contain ~1mb of text data.

Because order of merges is important in my case, I'm suspecting this is due to the DAG generation, causing the whole thing to be run at the moment I store away the data.

Right now I'm attempting to use a .persist on the merged data frame but that also seems to go rather slowly.

EDIT:

As the job was running i noticed (even though I did a count and .persist) the dataframe in memory didn't look like a static dataframe. It looked like a stringed together path to all the merges it had been doing, effectively slowing down the job linearly.

Am I right to assume the var df is the culprit of this?

螺旋式失控

breakdown of the issue as I see it:

dfA = empty
dfC = dfA.increment(dfB)
dfD = dfC.increment(dfN)....

When I would expect DF' AC and D are object, spark things differently and does not care if I persist or repartition or not. to Spark it looks like this:

dfA = empty
dfC = dfA incremented with df B
dfD = ((dfA incremented with df B) incremented with dfN)....

Update2

To get rid of the persisting not working on DF's I could "break" the lineage when converting the DF to and RDD and back again. This has a little bit of an overhead but an acceptable one (job finishes in minutes rather than hours/never) I'll run some more tests on the persisting and formulate an answer in the form of a workaround.

Result: This only seems to fix these issues on the surface. In reality I'm back at square one and get OOM exceptions java.lang.OutOfMemoryError: GC overhead limit exceeded

If you have code like this:

var df = sc.parallelize(Seq(1)).toDF()

for(i<- 1 to 200000) {
  val df_add = sc.parallelize(Seq(i)).toDF()
  df = df.unionAll(df_add)
}

Then df will have 400000 partitions afterwards, which makes the following actions inefficient (because you have 1 tasks for each partition).

Try to reduce the number of partitions to eg 200 before persisiting the dataframe (using eg df.coalesce(200).write.saveAsTable(....) )

So the following is what I ended up using. It's performant enough for my usecase, it works and does not need persisting.

It is very much a workaround rather than a fix.

val mutableBufferArray = ArrayBuffer[DataFrame]()
mutableBufferArray.append(hiveContext.emptyDataframe())

for loop {

              val interm = mergeDataFrame(df, mutableBufferArray.last)
              val intermSchema = interm.schema
              val intermRDD = interm.rdd.repartition(8)


              mutableBufferArray.append(hiveContext.createDataFrame(intermRDD, intermSchema))
              mutableBufferArray.remove(0)

}

This is how I wrestle tungsten into compliance. By going from a DF to an RDD and back I end up with a real object rather than a whole tungsten generated process pipe from front to back.

In my code I iterate a few times before writing out to disk (50-150 iterations seem to work best). That's where I clear out the bufferArray again to start over fresh.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM