简体   繁体   中英

How do I configure S3 access for org.apache.parquet.avro.AvroParquetReader?

I struggled with this for a while and wanted to share my solution. AvroParquetReader is a fine tool for reading Parquet, but its defaults for S3 access are weak:

java.io.InterruptedIOException: doesBucketExist on MY_BUCKET: com.amazonaws.AmazonClientException: No AWS Credentials provided by BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider SharedInstanceProfileCredentialsProvider : com.amazonaws.AmazonClientException: Unable to load credentials from service endpoint

I want to use credentials providers akin to those used by com.amazonaws.auth.profile.ProfileCredentialsProvider, which works for accessing my S3 bucket, but it is not clear from AvroParquetReader's class definition or documentation how I would achieve this.

This code worked for me. It allowed AvroParquetReader to access S3 using ProfileCredentialsProvider.

import com.amazonaws.auth.AWSCredentialsProvider;
import com.amazonaws.auth.profile.ProfileCredentialsProvider;
import org.apache.parquet.avro.AvroParquetReader;
import org.apache.parquet.hadoop.ParquetReader;
import org.apache.hadoop.fs.Path;
import org.apache.avro.generic.GenericRecord;
import org.apache.hadoop.conf.Configuration;

...

final String path = "s3a://"+bucketName+"/"+pathName;
final Configuration configuration = new Configuration();
configuration.setClass("fs.s3a.aws.credentials.provider", ProfileCredentialsProvider.class,
        AWSCredentialsProvider.class);
ParquetReader<GenericRecord> parquetReader =
        AvroParquetReader.<GenericRecord>builder(new Path(path)).withConf(configuration).build();

For anyone else experiencing problems with this I found that @jd_free answer didn't work for me. The only thing I needed to change for this to work was the configuration settings passed to AvroParquetReader regarding the kind of AWSCredentialsProvider used:

Configuration configuration = new Configuration();
        configuration.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider");
        configuration.set("fs.s3a.access.key", "KEY");
        configuration.set("fs.s3a.secret.key", "KEY");`

The problem was the credentials provider given, and the way it was given to the configuration. For more information on the different credential providers you can use check out this page . It explains the different kinds you can use for different scenarios, including how to grab credentials from your environment variables.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM