简体   繁体   中英

How can I reduce compute costs and waste in my Foundry transforms?

We have a lot of pretty complicated data pipelines and the amount of compute being consumed has been steadily rising every month. How can I figure out where compute is being wasted and make things more efficient?

So, this will turn into a little bit of an involved answer but hopefully I can point people to a useful set of resources to help them manage waste.

Let's start in the obvious place. Compute profiles : Engineers will commonly increase the executor memory to solve an executor OOM, the cause of this OOM is often skew. Try to mitigate the skew first and increase memory usage second. Memory is relatively cheap, but when you increase memory you do so on every executor, which can get expensive across a large number of executors. Usually only a single executor is OOMing and 90% of the time it is due to skew.

Local Spark: You can use the compute profile KUBERNETES_NO_EXECUTORS on small transforms (a rule of thumb might be <50mb of input and output data) which will mean your transform will be run on the driver ( reminder on drivers vs executors ) This will mean 2 fewer modules are spun up reducing the amount of resources consumed by 66%. Often a job this small does not need executors and using them just causes shuffles and other wasted compute. When you're dealing with small data try to use local spark, your jobs will spin up faster, and will cost less.

Views: Docs on views have not been added to public docs yet, but you can find them on your platform docs at documentation/product/views/overview . Views are a really useful way to reduce compute usage by eliminating the need for a transform altogether. Anywhere you have an identity transform being used to move a dataset between projects, or a transform that exists only to union several other datasets together, this transform can be replaced by a view. Views work by containing the information on the backing datasets and files, rather than containing any files themselves. They therefore require no processing of their own.

Incremental Pipelines : Where you have data that does not need to be changed after it is processed you might be able to use an incremental pipeline. This way you only process the new data as it comes into your pipeline without having to reprocess the entire mass of data. This is probably the most powerful tool to reduce compute consumption in large intensive pipelines with high data throughput.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM