How to convert csv file in S3 bucket to RDD

Question

I'm pretty new with this topic so any help will be much appreciated.

I trying to read a csv file which is stored in a S3 bucket and convert its data to an RDD to work directly with it without the need to create a file locally.

So far I've been able to load the file using AmazonS3ClientBuilder, but the only thing I've got is to have the file content in a S3ObjectInputStream and I'm not able to work with its content.

val bucketName = "bucket-name"

val credentials = new BasicAWSCredentials(
   "acessKey",
   "secretKey"
);

val s3client = AmazonS3ClientBuilder
    .standard()
    .withCredentials(new AWSStaticCredentialsProvider(credentials))
    .withRegion(Regions.US_EAST_2)
    .build();

val s3object = s3client.getObject(bucketName, "file-name.csv")
val inputStream = s3object.getObjectContent()
....

I have also tried to use a BufferedSource to work with it but once done, I don't know how to convert it to a dataframe or RDD to work with it.

val myData = Source.fromInputStream(inputStream)
....

Answer 1

You can do it with S3A file system provided in Hadoop-AWS module:

Add this dependency https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws
Either define <property><name>fs.s3.impl</name><value>org.apache.hadoop.fs.s3a.S3AFileSystem</value></property> in core-site.xml or add .config("fs.s3.impl", classOf[S3AFileSystem].getName) to SparkSession builder
Access S3 using spark.read.csv("s3://bucket/key") . If you want the RDD that was asked spark.read.csv("s3://bucket/key").rdd

Answer 2

最后，我能够得到我正在寻找的结果，看看https://gist.github.com/snowindy/d438cb5256f9331f5eec

How to convert csv file in S3 bucket to RDD

Question

2 answers

solution1
0 2019-05-05 17:04:54

solution2
0 2019-05-10 15:53:00

How to convert csv file in S3 bucket to RDD

Question

2 answers

solution1 0 2019-05-05 17:04:54

solution2 0 2019-05-10 15:53:00

solution1
0 2019-05-05 17:04:54

solution2
0 2019-05-10 15:53:00