简体   繁体   中英

Read Avro with Spark in java

Can somebody share example of reading avro using java in spark? Found scala examples but no luck with java. Here is the code snippet which is part of code but running into compilation issues with the method ctx.newAPIHadoopFile .

JavaSparkContext ctx = new JavaSparkContext(sparkConf);
Configuration hadoopConf = new Configuration();
JavaRDD<SampleAvro> lines = ctx.newAPIHadoopFile(path, AvroInputFormat.class, AvroKey.class, NullWritable.class, new Configuration());


You can use the spark-avro connector library by Databricks.
The recommended way to read or write Avro data from Spark SQL is by using Spark's DataFrame APIs.

The connector enables both reading and writing Avro data from Spark SQL:

import org.apache.spark.sql.*;

SQLContext sqlContext = new SQLContext(sc);

// Creates a DataFrame from a specified file
DataFrame df = sqlContext.read().format("com.databricks.spark.avro")

// Saves the subset of the Avro records read in
df.filter($"age > 5").write()

Note that this connector has different versions for Spark 1.2, 1.3, and 1.4+:

Spark ver connector
1.2 0.2.0
1.3 1.0.0
1.4+ 2.0.1

Using Maven:


See further info at: Spark SQL Avro Library

Here, assuming K is your Key and V is your value:


val job = new Job();

job.setInputFormatClass(AvroKeyValueInputFormat<K, V>.class);

FileInputFormat.addInputPaths(job, <inputPaths>);
AvroJob.setInputKeySchema(job, <keySchema>);
AvroJob.setInputValueSchema(job, <valueSchema>);

RDD<AvroKey<K>, AvroValue<V>> avroRDD = 
  AvroKeyValueInputFormat<<K>, <V>>,

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM