简体繁体中英

In Spark Streaming, can we store data (hashmap) in Executor memory

原文 2016-08-22 15:08:25 8 1 caching/ apache-spark/ hashmap/ streaming/ executor

I want to maintain a cache (HashMap) in Spark Executors memory (long lived cache) so that all tasks running on the executor (at different times) can do a lookup there and also be able to update the cache.

Is this possible in Spark streaming?

1 answers

I'm not sure there is a way to store custom data structures permanently on executors. My suggestion here is to use some external caching system (like Redis, Memcached or even ZooKeeper in some cases). You can further connect to that system using such methods like foreachPartition or mapPartitions while processing RDD/DataFrame to reduce the number of connections to 1 connection per partition.

The reason of why this would work is that ie both Redis and Memcached are in-memory storages so there will be no overhead of spilling data to disk.

The two other ways to distribute some state across executors are Accumulators and Broadcast variables. For Accumulators all executors can write into it but reading can be performed only by driver. For Broadcast variable you write it only once on driver and then distribute it as a read-only data structure to executors. Both cases doesn't work for you so the described solution is the only possible way that I can see here.

Can we use cached RDD across batches on an executor

How we can store data in application scope for caching purpose?

Spark Streaming Data-frame Persist Operation

where does df.cache() is stored is this store on driver memory or executor memory

In memory cache persisted between batches spark structured streaming

Why Apache Spark has added cache() method even though we can achieve the same functionality by calling persist(StorageLevel.MEMORY_ONLY)

Python proxy to store data in memory

How to lazily build a cache from spark streaming data

Can I constrain a HashMap by the amount of memory it takes up?

Can we store data in local database from flight offers search API for fast performance?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Can we use cached RDD across batches on an executor How we can store data in application scope for caching purpose? Spark Streaming Data-frame Persist Operation where does df.cache() is stored is this store on driver memory or executor memory In memory cache persisted between batches spark structured streaming Why Apache Spark has added cache() method even though we can achieve the same functionality by calling persist(StorageLevel.MEMORY_ONLY) Python proxy to store data in memory How to lazily build a cache from spark streaming data Can I constrain a HashMap by the amount of memory it takes up? Can we store data in local database from flight offers search API for fast performance?

Related Tags

In Spark Streaming, can we store data (hashmap) in Executor memory

Question

1 answers

solution1 3 ACCPTED 2016-08-22 22:51:30

solution1
3 ACCPTED 2016-08-22 22:51:30