简体   繁体   English

使用AWS Java Lambda将Parquet文件写入S3

[英]Writing parquet files to S3 using AWS java lamda

I'm writing AWS lambda that reads protobuf obejcts from Kinesis and would like to write them to s3 as parquet file. 我正在编写从Lambda读取Kinesis的protobuf对象的AWS Lambda,并希望将它们作为木地板文件写入s3。

I saw there's a implementation of ParquetWriter for protobuf called ProtoParquetWriter, which is good. 我看到有一个用于protobuf的ParquetWriter实现,称为ProtoParquetWriter,很好。 My problem is that ProtoParquetWriter expects a Path in its constructor. 我的问题是ProtoParquetWriter期望在其构造函数中使用Path。

What's the right way to do that without saving the content as parquet file, assuming I'm not using the file system at all? 假设我根本不使用文件系统,那么在不将内容另存为拼花文件的情况下正确的做法是什么?

If you want to write to S3, you can set the Path as Path("s3a://<bucketName>/<s3Key>") . 如果要写入S3,可以将路径设置为Path("s3a://<bucketName>/<s3Key>") And don't forget to set S3 credentials in the configurations: 并且不要忘记在配置中设置S3凭据:

    conf.set("fs.s3a.access.key", "<s3AccessKey");
    conf.set("fs.s3a.secret.key", "<s3SecretKey");

Assuming you have a List (can be any complex object), sample code to read/write protobuf S3 parquet 假设您有一个列表(可以是任何复杂的对象),请使用示例代码读取/写入protobuf S3实木复合地板

    public class SimpleS3ParquetUtilities implements S3Utilities {

    final Logger logger;
    String PATH_SCHEMA = "s3a";
    CompressionCodecName compressionCodecName;

    public SimpleS3ParquetUtilities(Logger logger) {
        this.logger = logger;
        this.compressionCodecName = CompressionCodecName.UNCOMPRESSED;
    }

    public SimpleS3ParquetUtilities(Logger logger, CompressionCodecName compressionCodecName) {
        this.logger = logger;
        this.compressionCodecName = compressionCodecName;
    }

    @Override
    public String writeTransactions(String bucket, String objectKey, List<Transaction> transactions)
            throws Exception {
        if (objectKey.charAt(0) != '/')
            objectKey = "/" + objectKey;
        Path file = new Path(PATH_SCHEMA, bucket, objectKey);
        Stopwatch sw = Stopwatch.createStarted();
        // convert the list into protobuf 
        List<TransactionProtos.Transaction> protoTransactions = Convertor.toProtoBuf(transactions);
        try (ProtoParquetWriter<TransactionProtos.Transaction> writer = new ProtoParquetWriter<TransactionProtos.Transaction>(
                file, TransactionProtos.Transaction.class, this.compressionCodecName,
                ProtoParquetWriter.DEFAULT_BLOCK_SIZE, ProtoParquetWriter.DEFAULT_PAGE_SIZE)) {

            for (TransactionProtos.Transaction transaction : protoTransactions) {
                writer.write(transaction);
            }
        }
        logger.info("Parquet write elapse:[{}{}] Time:{}ms items:{}", bucket, objectKey,
                sw.elapsed(TimeUnit.MILLISECONDS), transactions.size());
        return "";
    }

    @Override
    public List<Transaction> readTransactions(String bucket, String pathWithFileName)
            throws Exception {
        if (pathWithFileName.charAt(0) != '/')
            pathWithFileName = "/" + pathWithFileName;
        Path file = new Path(PATH_SCHEMA, bucket, pathWithFileName);
        Stopwatch sw = Stopwatch.createStarted();
        try (ParquetReader<TransactionProtos.Transaction.Builder> reader = ProtoParquetReader.<TransactionProtos.Transaction.Builder>builder(
                file).build()) {
            List<TransactionProtos.Transaction> transactions = new ArrayList<TransactionProtos.Transaction>();
            TransactionProtos.Transaction.Builder builder = reader.read();
            while (builder != null) {
                TransactionProtos.Transaction transaction = builder.build();
                transactions.add(transaction);
                builder = reader.read();
            }
            logger.info("Parquet read elapsed:[{}{}] Time:{}ms items:{}", bucket, pathWithFileName,
                    sw.elapsed(TimeUnit.MILLISECONDS), transactions.size());
            return Convertor.fromProtoBuf(transactions);
        }
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用Java,aws-lamda在S3存储桶中以.gz格式压缩文件 - How to zip file in .gz format in s3 bucket using java,aws-lamda 如何使用 java AWS lambda 将文件存储到 AWS S3? - How to store files to AWS S3 using java AWS lambda? 如何使用 Java 生成具有大量数据的 parquet 文件并上传到 aws s3 存储桶 - How to generate parquet file with large amount of data using Java and upload to aws s3 bucket 如何使用 AWS Java SDK for S3 查询 AWS S3 存储桶以匹配对象(文件)名称 - how to query AWS S3 bucket for matching objects(files) names using AWS Java SDK for S3 在java中使用deleteObjects从aws s3中删除文件,成功删除但文件没有被删除 - using deleteObjects in java for deleting files from aws s3 , getting successfully delete but files are not getting deleted 使用Java进行AWS S3文件搜索 - AWS S3 file search using Java 使用AWS开发工具包Java上传S3 - Uploading S3 using AWS SDK Java 使用 Scala 将文件写入 Amazon S3 AWS 存储 - Writing a file into Amazon S3 AWS Storage using Scala AWS使用JAVA从s3对象创建新文件获取错误 - AWS Creating new files from an s3 object using JAVA getting error 如何使用 aws java sdk 将文件从 S3 存储桶从一个区域复制到另一个区域? - How to copy files from S3 bucket from one region to another region using aws java sdk?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM