简体   繁体   English

在java中阅读Avro with Spark

[英]Read Avro with Spark in java

Can somebody share example of reading avro using java in spark? 有人可以在spark中分享使用java阅读avro的例子吗? Found scala examples but no luck with java. 找到scala示例但没有运气的java。 Here is the code snippet which is part of code but running into compilation issues with the method ctx.newAPIHadoopFile . 下面是代码片段,它是代码的一部分,但是ctx.newAPIHadoopFile方法的编译问题。

JavaSparkContext ctx = new JavaSparkContext(sparkConf);
Configuration hadoopConf = new Configuration();
JavaRDD<SampleAvro> lines = ctx.newAPIHadoopFile(path, AvroInputFormat.class, AvroKey.class, NullWritable.class, new Configuration());

Regards 问候

You can use the spark-avro connector library by Databricks. 您可以使用Databricksspark-avro连接器库
The recommended way to read or write Avro data from Spark SQL is by using Spark's DataFrame APIs. 从Spark SQL读取或编写Avro数据的推荐方法是使用Spark的DataFrame API。

The connector enables both reading and writing Avro data from Spark SQL: 连接器支持从Spark SQL读取和写入Avro数据:

import org.apache.spark.sql.*;

SQLContext sqlContext = new SQLContext(sc);

// Creates a DataFrame from a specified file
DataFrame df = sqlContext.read().format("com.databricks.spark.avro")
    .load("src/test/resources/episodes.avro");

// Saves the subset of the Avro records read in
df.filter($"age > 5").write()
    .format("com.databricks.spark.avro")
    .save("/tmp/output");

Note that this connector has different versions for Spark 1.2, 1.3, and 1.4+: 请注意,此连接器具有Spark 1.2,1.3和1.4+的不同版本:

Spark ver connector Spark ver 连接器
1.2 0.2.0 1.2 0.2.0
1.3 1.0.0 1.3 1.0.0
1.4+ 2.0.1 1.4+ 2.0.1

Using Maven: 使用Maven:

<dependency>
    <groupId>com.databricks</groupId>
    <artifactId>spark-avro_2.10</artifactId>
    <version>{AVRO_CONNECTOR_VERSION}</version>
</dependency>

See further info at: Spark SQL Avro Library 请参阅以下内容: Spark SQL Avro Library

Here, assuming K is your Key and V is your value: 在这里,假设K是你的钥匙而V是你的价值:

....

val job = new Job();

job.setInputFormatClass(AvroKeyValueInputFormat<K, V>.class);

FileInputFormat.addInputPaths(job, <inputPaths>);
AvroJob.setInputKeySchema(job, <keySchema>);
AvroJob.setInputValueSchema(job, <valueSchema>);

RDD<AvroKey<K>, AvroValue<V>> avroRDD = 
  sc.newAPIHadoopRDD(job.getConfiguration,
  AvroKeyValueInputFormat<<K>, <V>>,
  AvroKey<K>.class,
  AvroValue<V>.class);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM