How to read a text file in S3 bucket from inside an AWS EMR without using spark

Question

I need to open a regular text file located in an S3 bucket (NOT a parquet or CSV file) from an EMR cluster. I can directly open CSVs or parquet files directly using spark.read.parquet("s3://mybucket/some_parq_file")

But I need to read just a regular text file from EMR cluster using java.io.File or scala.io.Source . Get a java.io.FileNotFoundException when I try

import scala.io.Source
val hdr = "s3://mybucket/txtfile.txt"
for (line <- Source.fromFile(hdr).getLines) {
    println(line)
}

Answer 1

u can give bootstrap script, when the EMR comesup and the bootstrap script (.sh file) can access s3 file (i have used this multiple times)
u can submit EMR steps, which executes a jar file and the jar may access the s3

Answer 2

I guess that most AWS setup already has the credentials configured in your EMR cluster using the default credential chain and default region provider chain. This should also apply to AWS Lambda. So to access my S3 buckets from EMR cluster, I simply had to use AWSS3ClientBuilder

import com.amazonaws.services.s3.AmazonS3ClientBuilder
import java.io.File
import java.nio.file.{Files, StandardCopyOption}

val bucket ="s3_bucket"
val file_in_s3 = "somefile.txt"
val dest = "/tmp/local_file.txt"
val s3 = AmazonS3ClientBuilder.defaultClient()
val stream = s3.getObject(bucket, file_in_s3).getObjectContent

Files.copy(stream, new File(dest).toPath, StandardCopyOption.REPLACE_EXISTING)

How to read a text file in S3 bucket from inside an AWS EMR without using spark

Question

2 answers

solution1
0 2019-11-18 06:53:34

solution2
0 ACCPTED 2019-12-16 18:12:17

How to read a text file in S3 bucket from inside an AWS EMR without using spark

Question

2 answers

solution1 0 2019-11-18 06:53:34

solution2 0 ACCPTED 2019-12-16 18:12:17

solution1
0 2019-11-18 06:53:34

solution2
0 ACCPTED 2019-12-16 18:12:17