简体   繁体   中英

Spark window partition function taking forever to complete

Given a dataframe, I am trying to compute how many times I have seen an emailId in the past 30 days. The main logic in my function is the following:

val new_df = df
  .withColumn("transaction_timestamp", unix_timestamp($"timestamp").cast(LongType))

val winSpec = Window
  .partitionBy("email")
  .orderBy(col("transaction_timestamp"))
  .rangeBetween(-NumberOfSecondsIn30Days, Window.currentRow)

val resultDF = new_df
  .filter(col("condition"))
  .withColumn("count", count(col("email")).over(winSpec))

The config:

spark.executor.cores=5

So, I can see 5 stages which have window functions in them, some stages out of those are completed very quickly (in a few seconds) and there are 2 that did not even finish in 3 hours, being stuck at the last few tasks(progressing very slowly):

This is a problem of data skew to me, if I remove all the rows containing the 5 highest frequency email ids from the dataset, the job finishes soon(less than 5 min).

If I try to use some other key within window partitionBy, the job finishes in a few minutes:

 Window.partitionBy("email", "date")

But obviously it performs wrong count calculations if I do that and it's not an acceptable solution.

I have tried various other spark settings throwing more memory, cores, parallelism etc. and none of those have seemed to help .

Spark Version: 2.2

Current Spark configuration:

-executor-memory: 100G

-executor-cores: 5

-driver memory: 80G

-spark.executor.memory=100g

Using machines each with 16 core, 128 gb memory. Maximum # of nodes up to 500.

What would be the right way to tackle this problem?

Update: Just to give more context, here is the original dataframe and the corresponding computed dataframe:

 val df = Seq(
      ("a@gmail.com", "2019-10-01 00:04:00"),
      ("a@gmail.com", "2019-11-02 01:04:00"), 
      ("a@gmail.com", "2019-11-22 02:04:00"),
      ("a@gmail.com", "2019-11-22 05:04:00"),
      ("a@gmail.com", "2019-12-02 03:04:00"),
      ("a@gmail.com", "2020-01-01 04:04:00"),
      ("a@gmail.com", "2020-03-11 05:04:00"),
      ("a@gmail.com", "2020-04-05 12:04:00"),
      ("b@gmail.com", "2020-05-03 03:04:00")  
    ).toDF("email", "transaction_timestamp")


val expectedDF = Seq(
      ("a@gmail.com", "2019-10-01 00:04:00", 1),
      ("a@gmail.com", "2019-11-02 01:04:00", 1), // prev one falls outside of last 30 days win
      ("a@gmail.com", "2019-11-22 02:04:00", 2),
      ("a@gmail.com", "2019-11-22 05:04:00", 3),
      ("a@gmail.com", "2019-12-02 03:04:00", 3),
      ("a@gmail.com", "2020-01-01 04:04:00", 1),
      ("a@gmail.com", "2020-03-11 05:04:00", 1),
      ("a@gmail.com", "2020-04-05 12:04:00", 2),
      ("b@gmail.com", "2020-05-03 03:04:00", 1) // new email
).toDF("email", "transaction_timestamp", count") 

You are right, this is a data skew issue and reducing the window size would help a lot. To get information about only the last 30 days, you do not need to go until the very beginning of times. Yet again, if build one window with a time index, the calculation will be wrong at the beginning of each window since it will not have access to the previous window.

What I propose is to build one index that is incremented every 30 days and two overlapping windows of size 60 days as shown in the following figure:

重叠窗口

To understand how this works, let's consider as shown on the figure a data point with index=2 . If you had a window of size 30 days, it would need to access data within its window and within the previous one. That's not possible. This is why we build larger windows so that we can access all the data. If we consider win1 , we have the same problem as with the index of size 30 days. If we consider win2 however, all the data is available in the window of index 1.

For a point with index 3, we would use win1 . For a point with index 4, win2 etc. Basically, for even indices, we use win2 . For odd indices, we use win1 . This approach will considerably reduce the maximum partition size and thus the maximum amount of data handled in one single task.

The code is just a translation of what was explained above:

val winSize = NumberOfSecondsIn30Days

val win1 = Window
    .partitionBy("email", "index1")
    .orderBy(col("transaction_timestamp"))
    .rangeBetween(-winSize, Window.currentRow)
val win2 = Window
    .partitionBy("email", "index2")
    .orderBy(col("transaction_timestamp"))
    .rangeBetween(-winSize, Window.currentRow)

val indexed_df = new_df
    // the group by is only there in case there are duplicated timestamps,
    // so as to lighten the size of the windows
    .groupBy("email", "transaction_timestamp")
    .count()
    .withColumn("index",
        'transaction_timestamp / winSize cast "long")
    .withColumn("index1",
        ('transaction_timestamp / (winSize * 2)) cast "long")
    .withColumn("index2",
        (('transaction_timestamp + winSize) / (winSize * 2)) cast "long")

val result = indexed_df
    .withColumn("count", when(('index mod 2) === 0, sum('count) over win2)
                                      .otherwise(sum('count) over win1))

Some of your partitions are probably too large which is due to the fact that for some emails, there is too much data in one month.

To fix this, you can create a new dataframe with only the emails and the timestamps. Then, you group by email and timestamp, count the number of lines and compute the window on hopefully much less data. The computation will be sped up if timestamps tend to be duplicated, that is if df.count is much greater than df.select("email", "timestamp").distinct.count . If it is not the case, you can truncate the timestamp at the cost of losing some precision. This way, instead of counting the number of occurrences within the last 30 days (give or take one second since timestamps are in seconds), you would count the number of occurrences give or take one minute or one hour or even one day depending on your need. You would lose a bit of precision but speed up you computation a lot. And the more precision you give in, the more speed you gain.

The code would look like this:

// 3600 means hourly precision.
// Set to 60 for minute precision, 1 for second precision, 24*3600 for one day.
// Note that even precisionLoss = 1 might make you gain speed depending on
// the distribution of your data
val precisionLoss = 3600 
val win_size = NumberOfSecondsIn30Days / precisionLoss

val winSpec = Window
  .partitionBy("email")
  .orderBy("truncated_timestamp")
  .rangeBetween(-win_size, Window.currentRow)

val new_df = df.withColumn("truncated_timestamp",
                      unix_timestamp($"timestamp") / 3600 cast "long")

val counts = new_df
  .groupBy("email", "truncated_timestamp")
  .count
  .withColumn("count", sum('count) over winSpec)

val result = new_df
  .join(counts, Seq("email", "truncated_timestamp"))

We can still avoid Window for this

For the above mentioned df

val df2 = df.withColumn("timestamp", unix_timestamp($"transaction_timestamp").cast(LongType))

val df3 = df2.withColumnRenamed("timestamp","timestamp_2").drop("transaction_timestamp")

val finalCountDf = df2.join(df3,Seq("email"))
.withColumn("is_within_30", when( $"timestamp" - $"timestamp_2" < NumberOfSecondsIn30Days && $"timestamp" - $"timestamp_2" > 0 , 1).otherwise(0))
.groupBy("email","transaction_timestamp").agg(sum("is_within_30") as "count")
.withColumn("count",$"count"+1)

finalCountDf.orderBy("transaction_timestamp").show
/*
+-----------+---------------------+-----+
|      email|transaction_timestamp|count|
+-----------+---------------------+-----+
|a@gmail.com|  2019-10-01 00:04:00|    1|
|a@gmail.com|  2019-11-02 01:04:00|    1|
|a@gmail.com|  2019-11-22 02:04:00|    2|
|a@gmail.com|  2019-11-22 05:04:00|    3|
|a@gmail.com|  2019-12-02 03:04:00|    3|
|a@gmail.com|  2020-01-01 04:04:00|    1|
|a@gmail.com|  2020-03-11 05:04:00|    1|
|a@gmail.com|  2020-04-05 12:04:00|    2|
|b@gmail.com|  2020-05-03 03:04:00|    1|
+-----------+---------------------+-----+
*/

Explanation:

  • Make pairs of the timestamps based on "email" (join on email)
  • Compare each pair and check if it lies within the last 30 days: If so mark it as 1 other wise 0
  • Sum up the counts WRT "email" and "transaction_timestamp"

Assumption: (Email, transaction_timestamp) is distinct. If not, we can handle by adding a monotonicallyIncreasingId

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM