简体   繁体   中英

Efficient way to load csv file in spark/scala

I am trying to load a csv file in scala from spark. I see that we can do using the below two different syntaxes:

  sqlContext.read.format("csv").options(option).load(path)
  sqlContext.read.options(option).csv(path)

What is the difference between these two and which gives the better performance? Thanks

There's no difference.

So why do both exist?

  • The .format(fmt).load(path) method is a flexible, pluggable API that allows adding more formats without having to re-compile spark - you can register aliases for custom Data Source implementations and have Spark use them; "csv" used to be such a custom implementation (outside of the packaged Spark binaries), but it is now part of the project
  • There are shorthand methods for "built-in" data sources (like csv , parquet , json ...) which make the code a bit simpler (and verified at compile time)

Eventually, they both create a CSV Data Source and use it to load the data.

Bottom line, for any supported format, you should opt for the "shorthand" method, eg csv(path) .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM