简体   繁体   English

Spark-Java-在不使用Spark SQL数据框的情况下创建Parquet / Avro

[英]Spark - Java - Create Parquet/Avro Without Using Dataframes of Spark SQL

I want to get output of a Spark application(which we only use core Spark and people working on the project do not want to change it to Spark SQL) as Parquet or Avro files. 我想以Parquet或Avro文件的形式获取Spark应用程序的输出(我们仅使用核心Spark,而在该项目上工作的人员不希望将其更改为Spark SQL)。

When I look for these two file types, I couldn't find any example without DataFrames, or in general Spark SQL. 当我寻找这两种文件类型时,如果没有DataFrames或一般的Spark SQL,就找不到任何示例。 Can I achieve this without using SparkSQL? 我可以不使用SparkSQL来实现吗?

My data is tabular, it has columns but in the processing, all data will be used, not a single column. 我的数据是表格形式的,它具有列,但是在处理中,将使用所有数据,而不是单个列。 It's columns are decided at runtime, so there is no "name,ID,adress" kinda generic columns. 它的列是在运行时确定的,因此没有“名称,ID,地址”等通用列。 It looks like this: 看起来像这样:

No f1       f2       f3       ...
1, 123.456, 123.457, 123.458, ...
2, 123.789, 123.790, 123.791, ...
...

You can't save an rdd in parquet without converting it to dataframe. 如果不将rdd转换为数据帧,就无法将其保存在拼花地板中。 Rdd does not have schema but parquet file is in columnar format which needs schema, so we need to convert it to dataframe. Rdd没有架构,但是镶木地板文件为需要架构的柱状格式,因此我们需要将其转换为数据框。

You can use createdataframe api 您可以使用createdataframe API

I tried this and it works like a champ... 我试过了,它就像冠军...

public class ParquetHelper{

    static ParquetWriter<GenericData.Record> writer = null;
    private static Schema schema;

    public ParquetHelper(Schema schema, String pathName){

        try {
            Path path = new Path(pathName);
            writer = AvroParquetWriter.
                    <GenericData.Record>builder(path)
                    .withRowGroupSize(ParquetWriter.DEFAULT_BLOCK_SIZE)
                    .withPageSize(ParquetWriter.DEFAULT_PAGE_SIZE)
                    .withSchema(schema)
                    .withConf(new Configuration())
                    .withCompressionCodec(CompressionCodecName.SNAPPY)
                    .withValidation(true)
                    .withDictionaryEncoding(false)
                    .build();
            this.schema = schema;
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }

     /*
     * 
     */
    public static void writeToParquet(JavaRDD<Record> empRDDRecords) throws IOException {

        empRDDRecords.foreach(record -> {
            if(null != record && new RecordValidator().validate(record, schema).isEmpty()){
                writeToParquet(record);
            }// TODO collect bad records here
        });

        writer.close();
    }

}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM