简体   繁体   中英

Read Parquet files from Scala without using Spark

Is it possible to read parquet files from Scala without using Apache Spark?

I found a project which allows us to read and write avro files using plain scala.

https://github.com/sksamuel/avro4s

However I can't find a way to read and write parquet files using plain scala program without using Spark?

It's straightforward enough to do using the parquet-mr project, which is the project Alexey Raga is referring to in his answer.

Some sample code

val reader = AvroParquetReader.builder[GenericRecord](path).build().asInstanceOf[ParquetReader[GenericRecord]]
// iter is of type Iterator[GenericRecord]
val iter = Iterator.continually(reader.read).takeWhile(_ != null)
// if you want a list then...
val list = iter.toList

This will return you a standard Avro GenericRecord s, but if you want to turn that into a scala case class, then you can use my Avro4s library as you linked to in your question, to do the marshalling for you. Assuming you are using version 1.30 or higher then:

case class Bibble(name: String, location: String)
val format = RecordFormat[Bibble]
// then for a given record
val bibble = format.from(record)

We can obviously combine that with the original iterator in one step:

val reader = AvroParquetReader.builder[GenericRecord](path).build().asInstanceOf[ParquetReader[GenericRecord]]
val format = RecordFormat[Bibble]
// iter is now an Iterator[Bibble]
val iter = Iterator.continually(reader.read).takeWhile(_ != null).map(format.from)
// and list is now a List[Bibble]
val list = iter.toList

还有一个名为eel的相对较新的项目,这是一个轻量级(非分布式处理)工具包,用于在小型中使用一些“大数据”技术。

Yes, you don't have to use Spark to read/write Parquet. Just use parquet lib directly from your Scala code (and that's what Spark is doing anyway): http://search.maven.org/#search%7Cga%7C1%7Cparquet

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM