Scala Spark isin broadcast list

Question

I'm trying to perform a isin filter as optimized as possible. Is there a way to broadcast collList using Scala API?

Edit: I'm not looking for an alternative, I know them, but I need isin so my RelationProviders will pushdown the values.

  val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
  //collList.size == 200.000
  val retTable = df.filter(col("col1").isin(collList: _*))

The list i'm passing to the "isin" method has upto ~200.000 unique elements.

I know this doesn't look like the best option and a join sounds better, but I need those elements pushed down into the filters , makes a huge difference when reading (my storage is Kudu, but it also applies to HDFS+Parquet, base data is too big and queries work on around 1% of that data) , I already measured everything, and it saved me around 30minutes execution time:). Plus my method already takes care if the isin is larger than 200.000.

My problem is, I'm getting some Spark "task are too big" (~8mb per task) warnings, everything works fine so not a big deal, but I'm looking to remove them and also optimize.

I've tried with, which does nothing as I still get the warning (since the broadcasted var gets resolved in Scala and passed to vargargs I guess):

  val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
  val retTable = df.filter(col("col1").isin(sc.broadcast(collList).value: _*))

And this one which doesn't compile:

  val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
  val retTable = df.filter(col("col1").isin(sc.broadcast(collList: _*).value))

And this one which doesn't work (task too big still appears)

  val broadcastedList=df.sparkSession.sparkContext.broadcast(collList.map(lit(_).expr))
  val filterBroadcasted=In(col("col1").expr, collList.value)
  val retTable = df.filter(new Column(filterBroadcasted))

Any ideas on how to broadcast this variable? (hacks allowed). Any alternative to the isin which allows filter pushdown is also valid I've seen some people doing it on PySpark, but the API is not the same.

PS: Changes on the storage are not possible, I know partitioning (already partitioned, but not by that field) and such could help, but user inputs are totally random and the data is accessed and changed my many clients.

Answer 1

I'd opt for dataframe broad cast hash join in this case instead of broadcast variable.

Prepare a dataframe with your collectedDf("col1") collection list you want to filter with isin and then use join between 2 dataframes to filter the rows matching.

I think it would be more efficient than isin since you have 200k entries to be filtered. spark.sql.autobroadcastjointhreshhold is the property you need to set with appropriate size(by default 10mb). AFAIK you can use till 200mb or 3oomb based on your requirements.

see this BHJ Explanation of how it works

Further reading Spark efficiently filtering entries from big dataframe that exist in a small dataframe

Answer 2

I'll just leave with big tasks since I only use it twice (but saves a lot of time) in my program and I can afford it, but if someone else needs it badly... well this seems to be the path.

Best alternatives I found to have big-arrays pushdown:

Change your relation provider so it broadcasts big-lists when pushing down In filters, this will probably leave some broadcasted trash, but well..., as long as your app is not streaming, it shouldn't be a problem, or you can save in a global list and clean those after a while
Add a filter in Spark (I wrote something at https://issues.apache.org/jira/browse/SPARK-31417 ) which allows broadcasted pushdown all the way to your relation provider . You would have to add your custom predicate, then implement your custom "Pushdown" (you can do this by adding a new rule) and then rewrite your RDD/Relation provider so it can exploit the fact the variable is broadcasted.
Use coalesce(X) after reading to decrease number of tasks, can work sometimes, depends on how the RelationProvider/RDD is implemented .

Scala Spark isin broadcast list

Question

2 answers

solution1
1 2020-04-08 23:30:21

I'd opt for dataframe broad cast hash join in this case instead of broadcast variable.

solution2
0 2020-04-09 17:43:21

Scala Spark isin broadcast list

Question

2 answers

solution1 1 2020-04-08 23:30:21

I'd opt for dataframe broad cast hash join in this case instead of broadcast variable.

solution2 0 2020-04-09 17:43:21

solution1
1 2020-04-08 23:30:21

solution2
0 2020-04-09 17:43:21