简体   繁体   中英

Spark writing performance csv vs snappy-orc

If I need to write dataframe on disk which format will perform better csv or 'orc with snappy'?

One hand csv format will avoid compression task overhead but on another hand snappy will reduce total byte size writing task. Please correct me in assumptions here as well?

Note that my question is about writing performance not the storage point of view.

Compression is about saving space, not performance, so the fact you're using Snappy is not really a relevant detail as you could use LZ4 or ZSTD instead, for example.

ORC is a column oriented data format that performs better for analytics than CSV, and under certain conditions, will outperform Spark's default format of Parquet.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM