简体   繁体   English

如何使用 AvroParquetWriter 并通过 AmazonS3 api 写入 S3?

[英]How can I use the AvroParquetWriter and write to S3 via the AmazonS3 api?

I am currently using the code below to write parquet via Avro.我目前正在使用下面的代码通过 Avro 编写镶木地板。 This code writes it to a file system but I want to write to S3.此代码将其写入文件系统,但我想写入 S3。

try {
    StopWatch sw = StopWatch.createStarted();
    Schema avroSchema = AvroSchemaBuilder.build("pojo", message.getTransformedMessage().get(0));
    final String parquetFile = "parquet/data.parquet";
    final Path path = new Path(parquetFile);

    ParquetWriter writer = AvroParquetWriter.<GenericData.Record>builder(path)
        .withSchema(avroSchema)
        .withConf(new org.apache.hadoop.conf.Configuration())
        .withCompressionCodec(CompressionCodecName.SNAPPY)
        .withWriteMode(Mode.OVERWRITE)//probably not good for prod. (overwrites files).
        .build();

    for (Map<String, Object> row : message.getTransformedMessage()) {
      StopWatch stopWatch = StopWatch.createStarted();
      final GenericRecord record = new GenericData.Record(avroSchema);
      row.forEach((k, v) -> {
        record.put(k, v);
      });
      writer.write(record);
    }
    //todo:  Write to S3.  We should probably write via the AWS objects.  This does not show that.
    //https://stackoverflow.com/questions/47355038/how-to-generate-parquet-file-using-pure-java-including-date-decimal-types-an
    writer.close();
    System.out.println("Total Time: " + sw);

  } catch (Exception e) {
    //do somethign here.  retryable?  non-retryable?  Wrap this excetion in one of these?
    transformedParquetMessage.getOriginalMessage().getMetaData().addException(e);
  }

This writes to a file fine, but how do I get it to stream it into the AmazonS3 api?这可以很好地写入文件,但是如何将其流式传输到 AmazonS3 api 中? I have found some code on the web using the Hadoop-aws jar, but that requires some Windows exe files to work and, of course, we want to avoid that.我在网上找到了一些使用 Hadoop-aws jar 的代码,但这需要一些 Windows exe 文件才能工作,当然,我们希望避免这种情况。 Currently I am using only:目前我只使用:

 <dependency>
  <groupId>org.apache.avro</groupId>
  <artifactId>avro</artifactId>
  <version>1.9.2</version>
</dependency>
<dependency>
  <groupId>org.apache.parquet</groupId>
  <artifactId>parquet-avro</artifactId>
  <version>1.8.1</version>
</dependency>
<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-core</artifactId>
  <version>1.2.1</version>
</dependency>

So the question is, is there a way to intercept the output stream on the AvroParquetWriter so I can stream it to S3?所以问题是,有没有办法拦截 AvroParquetWriter 上的输出流,以便我可以将其流式传输到 S3? The main reason I want to do this is for retries.我想这样做的主要原因是为了重试。 S3 automagically retries up to 3 times. S3 最多可自动重试 3 次。 This would help us out a lot.这对我们有很大帮助。

This does depend on the hadoop-aws jar, so if you're not willing to use that I'm not sure I can help you.这确实取决于 hadoop-aws jar,所以如果您不愿意使用它,我不确定我可以帮助您。 I am, however, running on a mac and do not have any windows exe files, so I'm not sure where you say those are coming from.但是,我在 mac 上运行并且没有任何 windows exe 文件,所以我不确定你说这些文件来自哪里。 The AvroParquetWriter already depends on Hadoop, so even if this extra dependency is unacceptable to you it may not be a big deal to others: AvroParquetWriter 已经依赖于 Hadoop,所以即使这种额外的依赖对你来说是不可接受的,但对其他人来说可能没什么大不了的:

You can use an AvroParquetWriter to stream directly to S3 by passing it a Hadoop Path that is created with a URI parameter and setting the proper configs.您可以使用 AvroParquetWriter 通过向 S3 传递一个使用 URI 参数创建的 Hadoop 路径并设置适当的配置来直接流式传输到 S3。

val uri = new URI("s3a://<bucket>/<key>")
val path = new Path(uri)

val config = new Configuration()
config.set("fs.s3a.access.key", key)
config.set("fs.s3a.secret.key", secret)
config.set("fs.s3a.session.token", sessionToken)
config.set("fs.s3a.aws.credentials.provider", credentialsProvider)

val writer = AvroParquetWriter.builder[GenericRecord](path).withConf(config).withSchema(schema).build()

I used the following dependencies (sbt format):我使用了以下依赖项(sbt 格式):

"org.apache.avro" % "avro" % "1.8.1"
"org.apache.hadoop" % "hadoop-common" % "2.9.0"
"org.apache.hadoop" % "hadoop-aws" % "2.9.0"
"org.apache.parquet" % "parquet-avro" % "1.8.1"

Hopefully I am not misunderstanding the question, but it seems here what you are doing is converting a avro to parquet and you'd like to upload the parquet to s3希望我没有误解这个问题,但在这里您所做的似乎是将 avro 转换为镶木地板,并且您想将镶木地板上传到 s3

After you close your ParquetWriter, you should call a method that looks like this (granted this doesn't intercept the stream writing from avro to parquet, it just streams the parquet file that is no longer being written to):关闭 ParquetWriter 后,您应该调用一个如下所示的方法(假设这不会拦截从 avro 到 parquet 的流写入,它只是流式传输不再写入的 parquet 文件):

        AmazonS3 s3Client = AmazonS3ClientBuilder.standard().withCredentials(new AWSStaticCredentialsProvider(new BasicAWSCredentials("ACCESS_KEY", "SECRET_KEY"))).build();
        S3Path outputPath = new S3Path();
        outputPath.setBucket("YOUR_BUCKET");
        outputPath.setKey("YOUR_FOLDER_PATH");
        try {
            InputStream parquetStream = new FileInputStream(new File(parquetFile));
            s3Client.putObject(outputPath.getBucket(), outputPath.getKey(), parquetStream, null);
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }

using the AWS SDK使用 AWS 开发工具包

<dependency>
    <groupId>com.amazonaws</groupId>
    <artifactId>aws-java-sdk</artifactId>
    <version>1.11.749</version>
</dependency>

Of course the method would reside in a different utils class and the constructor of this method should initialize the AmazonS3 s3Client with the credentials, so all you'd need to do is invoke and access it's s3Client member to put objects当然,该方法将驻留在不同的 utils 类中,并且此方法的构造函数应使用凭据初始化 AmazonS3 s3Client,因此您需要做的就是调用并访问它的 s3Client 成员以放置对象

hope this helps希望这可以帮助

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM