简体   繁体   English

使用Spark将CSV转换为ORC

[英]Converting CSV to ORC with Spark

I've seen this blog post by Hortonworks for support for ORC in Spark 1.2 through datasources. 我看过Hortonworks的这篇博客文章 ,它通过数据源支持Spark 1.2中的ORC。

It covers version 1.2 and it addresses the issue or creation of the ORC file from the objects, not conversion from csv to ORC. 它涵盖了1.2版,并且解决了对象中ORC文件的问题或创建问题,而不是从csv转换为ORC的问题。 I have also seen ways , as intended, to do these conversions in Hive. 我也看到按预期的方式在Hive中进行这些转换。

Could someone please provide a simple example for how to load plain csv file from Spark 1.6+, save it as ORC and then load it as a data frame in Spark. 有人可以提供一个简单的示例,说明如何从Spark 1.6+加载纯csv文件,将其另存为ORC,然后在Spark中将其加载为数据帧。

I'm going to ommit the CSV reading part because that question has been answered quite lots of time before and plus lots of tutorial are available on the web for that purpose, it will be an overkill to write it again. 我将省略CSV阅读部分,因为该问题已在很多时间之前得到了解答,并且为此目的在网络上提供了很多教程,再次编写将是一个过大的杀伤力。 Check here if you want ! 如果需要,请在这里检查

ORC support : ORC支持:

Concerning ORCs, they are supported with the HiveContext. 关于ORC,HiveContext支持它们。

HiveContext is an instance of the Spark SQL execution engine that integrates with data stored in Hive. HiveContext是Spark SQL执行引擎的实例,与存储在Hive中的数据集成。 SQLContext provides a subset of the Spark SQL support that does not depend on Hive but ORCs, Window function and other feature depends on HiveContext which reads the configuration from hive-site.xml on the classpath. SQLContext提供了Spark SQL支持的子集,该子集不依赖于Hive,但ORC,Window函数和其他功能依赖于HiveContext,后者从类路径中的hive-site.xml读取配置。

You can define a HiveContext as following : 您可以定义一个HiveContext,如下所示:

import org.apache.spark.sql.hive.orc._
import org.apache.spark.sql._
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)

If you are working with the spark-shell, you can directly use sqlContext for such purpose without creating a hiveContext since by default, sqlContext is created as a HiveContext. 如果使用的是spark-shell,则可以直接将sqlContext用于此目的,而无需创建hiveContext,因为默认情况下,sqlContext被创建为HiveContext。

Specifying as orc at the end of the SQL statement below ensures that the Hive table is stored in the ORC format. 在以下SQL语句的末尾指定as orc可以确保Hive表以ORC格式存储。 eg : 例如:

val df : DataFrame = ???
df.registerTempTable("orc_table")
val results = hiveContext.sql("create table orc_table (date STRING, price FLOAT, user INT) stored as orc")

Saving as an ORC file 另存为ORC文件

Let's persist the DataFrame into the Hive ORC table we created before. 让我们将DataFrame持久保存到我们之前创建的Hive ORC表中。

results.write.format("orc").save("data_orc")

To store results in a hive directory rather than user directory, use this path instead /apps/hive/warehouse/data_orc (hive warehouse path from hive-default.xml) 要将结果存储在hive目录而不是用户目录中,请改用/apps/hive/warehouse/data_orc (hive-default.xml中的hive仓库路径)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM