简体   繁体   中英

Sharing data across executors in Apache spark

My SPARK project (written in Java) requires to access (SELECT query results) different tables across executors.

One solution to this problem is :

  1. I create a tempView
  2. select required columns
  3. using forEach convert DataFrame to Map .
  4. pass that map as a broadcast variable across executors.

However, I have found that

  1. there many complex queries whose result cant be stored directly in Map
  2. Tables are very large and hence creating Map of large size and passing it to executors as a broadcast variable doesn't sound efficient.

Instead can we load tables in-memory using load which can be shared across executors?

Is void org.apache.spark.sql.Dataset.createOrReplaceTempView(String viewName)

or void org.apache.spark.sql.Dataset.createGlobalTempView(String viewName) throws AnalysisException

Method useful for this purpose?

SPARK VERSION : 2.3.0

You can broadcast a DataFrame. See documentation

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM