Scala Spark isin 广播列表

Question

I'm trying to perform a isin filter as optimized as possible.我正在尝试尽可能优化地执行 isin 过滤器。 Is there a way to broadcast collList using Scala API?有没有办法使用 Scala API 广播 collList？

Edit: I'm not looking for an alternative, I know them, but I need isin so my RelationProviders will pushdown the values.编辑：我不是在寻找替代方案，我知道他们，但我需要 isin 所以我的 RelationProviders 会下推这些值。

  val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
  //collList.size == 200.000
  val retTable = df.filter(col("col1").isin(collList: _*))

The list i'm passing to the "isin" method has upto ~200.000 unique elements.我传递给“isin”方法的列表最多有 200.000 个唯一元素。

I know this doesn't look like the best option and a join sounds better, but I need those elements pushed down into the filters , makes a huge difference when reading (my storage is Kudu, but it also applies to HDFS+Parquet, base data is too big and queries work on around 1% of that data) , I already measured everything, and it saved me around 30minutes execution time:).我知道这看起来不是最好的选择，并且 join 听起来更好，但是我需要将这些元素推入过滤器中，这在阅读时会产生巨大的影响（我的存储是 Kudu，但它也适用于 HDFS+Parquet，base数据太大，查询只能处理大约 1% 的数据） ，我已经测量了所有内容，它为我节省了大约 30 分钟的执行时间:)。 Plus my method already takes care if the isin is larger than 200.000.另外，如果 isin 大于 200.000，我的方法已经很小心了。

My problem is, I'm getting some Spark "task are too big" (~8mb per task) warnings, everything works fine so not a big deal, but I'm looking to remove them and also optimize.我的问题是，我收到一些 Spark“任务太大”（每个任务约 8mb）警告，一切正常，所以没什么大不了的，但我希望删除它们并进行优化。

I've tried with, which does nothing as I still get the warning (since the broadcasted var gets resolved in Scala and passed to vargargs I guess):我试过了，它什么也没做，因为我仍然收到警告（因为广播的 var 在 Scala 中得到解决并传递给我猜的 vargargs）：

  val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
  val retTable = df.filter(col("col1").isin(sc.broadcast(collList).value: _*))

And this one which doesn't compile:而这个不编译：

  val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
  val retTable = df.filter(col("col1").isin(sc.broadcast(collList: _*).value))

And this one which doesn't work (task too big still appears)而这个不起作用（任务太大仍然出现）

  val broadcastedList=df.sparkSession.sparkContext.broadcast(collList.map(lit(_).expr))
  val filterBroadcasted=In(col("col1").expr, collList.value)
  val retTable = df.filter(new Column(filterBroadcasted))

Any ideas on how to broadcast this variable?关于如何广播这个变量的任何想法？ (hacks allowed). （允许黑客攻击）。 Any alternative to the isin which allows filter pushdown is also valid I've seen some people doing it on PySpark, but the API is not the same.允许过滤器下推的 isin 的任何替代方法也是有效的我见过有人在 PySpark 上这样做，但 API 不一样。

PS: Changes on the storage are not possible, I know partitioning (already partitioned, but not by that field) and such could help, but user inputs are totally random and the data is accessed and changed my many clients. PS：不可能对存储进行更改，我知道分区（已经分区，但不是按该字段）等可能会有所帮助，但是用户输入是完全随机的，并且数据被访问并更改了我的许多客户端。

Answer 1

I'd opt for dataframe broad cast hash join in this case instead of broadcast variable.在这种情况下，我会选择 dataframe 广播 hash 加入而不是广播变量。

Prepare a dataframe with your collectedDf("col1") collection list you want to filter with isin and then use join between 2 dataframes to filter the rows matching.准备一个 dataframe 和您想要使用isin过滤的collectedDf("col1")集合列表，然后使用两个数据帧之间的连接来过滤匹配的行。

I think it would be more efficient than isin since you have 200k entries to be filtered.我认为它会比isin更有效，因为您有 200k 条目要过滤。 spark.sql.autobroadcastjointhreshhold is the property you need to set with appropriate size(by default 10mb). spark.sql.autobroadcastjointhreshhold是您需要设置适当大小的属性（默认为 10mb）。 AFAIK you can use till 200mb or 3oomb based on your requirements. AFAIK 您可以根据您的要求使用到 200mb 或 3oomb。

see this BHJ Explanation of how it works 看到这个 BHJ 解释它是如何工作的

Further reading Spark efficiently filtering entries from big dataframe that exist in a small dataframe进一步阅读Spark 有效地过滤来自小型 dataframe 中的大型 dataframe 的条目

Answer 2

I'll just leave with big tasks since I only use it twice (but saves a lot of time) in my program and I can afford it, but if someone else needs it badly... well this seems to be the path.我只会带着大任务离开，因为我在我的程序中只使用了两次（但节省了很多时间），而且我负担得起，但如果其他人非常需要它......好吧，这似乎是路径。

Best alternatives I found to have big-arrays pushdown:我发现有大阵列下推的最佳替代方案：

Change your relation provider so it broadcasts big-lists when pushing down In filters, this will probably leave some broadcasted trash, but well..., as long as your app is not streaming, it shouldn't be a problem, or you can save in a global list and clean those after a while更改您的关系提供程序，以便在按下过滤器时广播大列表，这可能会留下一些广播垃圾，但是...，只要您的应用程序没有流式传输，这应该不是问题，或者您可以保存在全局列表中并在一段时间后清理它们
Add a filter in Spark (I wrote something at https://issues.apache.org/jira/browse/SPARK-31417 ) which allows broadcasted pushdown all the way to your relation provider .在 Spark 中添加一个过滤器（我在https://issues.apache.org/jira/browse/SPARK-31417中写了一些东西），它允许广播下推到您的关系提供者。 You would have to add your custom predicate, then implement your custom "Pushdown" (you can do this by adding a new rule) and then rewrite your RDD/Relation provider so it can exploit the fact the variable is broadcasted.您必须添加您的自定义谓词，然后实现您的自定义“下推”（您可以通过添加新规则来做到这一点），然后重写您的 RDD/Relation 提供程序，以便它可以利用变量被广播的事实。
Use coalesce(X) after reading to decrease number of tasks, can work sometimes, depends on how the RelationProvider/RDD is implemented .阅读后使用 coalesce(X) 减少任务数量，有时可以工作，取决于 RelationProvider/RDD 的实现方式。

Scala Spark isin 广播列表

问题描述

2 个解决方案

解决方案1
1 2020-04-08 23:30:21

I'd opt for dataframe broad cast hash join in this case instead of broadcast variable.在这种情况下，我会选择 dataframe 广播 hash 加入而不是广播变量。

解决方案2
0 2020-04-09 17:43:21

Scala Spark isin 广播列表

问题描述

2 个解决方案

解决方案1 1 2020-04-08 23:30:21

I'd opt for dataframe broad cast hash join in this case instead of broadcast variable.在这种情况下，我会选择 dataframe 广播 hash 加入而不是广播变量。

解决方案2 0 2020-04-09 17:43:21

解决方案1
1 2020-04-08 23:30:21

解决方案2
0 2020-04-09 17:43:21