简体   繁体   中英

Spark structured streaming best VMs

I was hoping to ask if anyone found the best VM to use for Databricks clusters when running spark streaming.

I was testing out the Fv2 series (F32_v2), however I found out that most of the jobs have an issue with memory spill. With that said would it make sense to use more memory optimized clusters or add more compute VMs?

We are looking to see how we can improve the code, but as a general rule have you found some VM types work better with streaming jobs and some that do not work well (for example the L-series vs E-series vs F series).

Thank you in advance

It might depend on your use case. If you need more parallel processing - lets say you have more partitions on your message queue from you pull the data, you can go for compute optimized node and have more cores running in parallel and pulling data from message queue. If you feel your workload is memory intensive, you can go for memory optimized VMs.

This page has details around the benchmarking tests conducted on databricks and it might help you get some fair idea - https://www.databricks.com/blog/2017/10/11/benchmarking-structured-streaming-on-databricks-runtime-against-state-of-the-art-streaming-systems.html

Github repo with.dbc files for benchmarking - https://github.com/databricks/benchmarks

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM