在不使用Spark的情况下从Scala读取Parquet文件

Question

Is it possible to read parquet files from Scala without using Apache Spark? 是否可以在不使用Apache Spark的情况下从Scala读取镶木地板文件？

I found a project which allows us to read and write avro files using plain scala. 我找到了一个允许我们使用普通scala读写avro文件的项目。

https://github.com/sksamuel/avro4s https://github.com/sksamuel/avro4s

However I can't find a way to read and write parquet files using plain scala program without using Spark? 但是我找不到使用普通scala程序读取和编写镶木地板文件而不使用Spark的方法？

Answer 1

It's straightforward enough to do using the parquet-mr project, which is the project Alexey Raga is referring to in his answer. 使用镶木地板项目是非常简单的，这是Alexey Raga在他的回答中提到的项目。

Some sample code 一些示例代码

val reader = AvroParquetReader.builder[GenericRecord](path).build().asInstanceOf[ParquetReader[GenericRecord]]
// iter is of type Iterator[GenericRecord]
val iter = Iterator.continually(reader.read).takeWhile(_ != null)
// if you want a list then...
val list = iter.toList

This will return you a standard Avro GenericRecord s, but if you want to turn that into a scala case class, then you can use my Avro4s library as you linked to in your question, to do the marshalling for you. 这将返回一个标准的Avro GenericRecord ，但是如果你想把它变成一个scala案例类，那么你可以在你的问题中链接到我的Avro4s库，为你做编组。 Assuming you are using version 1.30 or higher then: 假设您使用的是1.30或更高版本，那么：

case class Bibble(name: String, location: String)
val format = RecordFormat[Bibble]
// then for a given record
val bibble = format.from(record)

We can obviously combine that with the original iterator in one step: 我们显然可以在一步中将它与原始迭代器结合起来：

val reader = AvroParquetReader.builder[GenericRecord](path).build().asInstanceOf[ParquetReader[GenericRecord]]
val format = RecordFormat[Bibble]
// iter is now an Iterator[Bibble]
val iter = Iterator.continually(reader.read).takeWhile(_ != null).map(format.from)
// and list is now a List[Bibble]
val list = iter.toList

Answer 2

还有一个名为eel的相对较新的项目，这是一个轻量级（非分布式处理）工具包，用于在小型中使用一些“大数据”技术。

Answer 3

Yes, you don't have to use Spark to read/write Parquet. 是的，您不必使用Spark来读/写Parquet。 Just use parquet lib directly from your Scala code (and that's what Spark is doing anyway): http://search.maven.org/#search%7Cga%7C1%7Cparquet 只需直接从你的Scala代码中使用镶木地板（这就是Spark正在做的事情）： http ： //search.maven.org/#search%7Cga%7C1%7Cparquet

在不使用Spark的情况下从Scala读取Parquet文件

问题描述

3 个解决方案

解决方案1
14 2016-02-24 05:58:29

解决方案2
7 2016-02-24 11:39:31

解决方案3
2 已采纳 2016-02-06 10:08:14

在不使用Spark的情况下从Scala读取Parquet文件

问题描述

3 个解决方案

解决方案1 14 2016-02-24 05:58:29

解决方案2 7 2016-02-24 11:39:31

解决方案3 2 已采纳 2016-02-06 10:08:14

解决方案1
14 2016-02-24 05:58:29

解决方案2
7 2016-02-24 11:39:31

解决方案3
2 已采纳 2016-02-06 10:08:14