Converting Avro to ORC in Java?

Question

I'm looking to create a bunch of ORC files from Avro messages consumed from Kafka.

I saw some sample code using Spark below. I am just running this in a standalone process and wondering what sorts of options I should look into. I want to pump these files into a cloud storage like S3 for example. Is there a recommended way of doing this?

SparkConf sparkConf = new SparkConf()
    .setAppName("Converter Service")
    .setMaster("local[*]");

SparkSession sparkSession = SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate();

// read input data
Dataset<Row> events = sparkSession.read()
    .format("json")
    .schema(inputConfig.getSchema()) // StructType describing input schema
    .load(inputFile.getPath());

// write data out
DataFrameWriter<Row> frameWriter = events
    .selectExpr(
        // useful if you want to change the schema before writing it to ORC, e.g. ["`col1` as `FirstName`", "`col2` as `LastName`"]
        JavaConversions.asScalaBuffer(outputSchema.getColumns()))
    .write()
    .options(ImmutableMap.of("compression", "zlib"))
    .format("orc")
    .save(outputUri.getPath());

Answer 1

Use Databricks avro reader for Spark to create dataframes. Spark natively supports ORC, so the file creation is a cinch.

You'll find the Avro library in Maven .

In Scala, it would loook something like this:

import spark.implicits._
import org.apache.spark.sql.SparkSession

val spark = SparkSession
  .builder()
  .appName("Spark SQL basic example")
  .config("spark.some.config.option", "some-value")
  .getOrCreate()

val df = spark.read.format("avro").load("/tmp/episodes.avro")
// From string: val df = spark.read.avro(Seq(avroString).toDS)

df.write.orc("name.orc") // You can write to S3 here

Converting Avro to ORC in Java?

Question

1 answers

solution1
0 2020-07-15 21:48:43

Converting Avro to ORC in Java?

Question

1 answers

solution1 0 2020-07-15 21:48:43

solution1
0 2020-07-15 21:48:43