简体   繁体   中英

How to read a text file in S3 bucket from inside an AWS EMR without using spark

I need to open a regular text file located in an S3 bucket (NOT a parquet or CSV file) from an EMR cluster. I can directly open CSVs or parquet files directly using spark.read.parquet("s3://mybucket/some_parq_file")

But I need to read just a regular text file from EMR cluster using java.io.File or scala.io.Source . Get a java.io.FileNotFoundException when I try

import scala.io.Source
val hdr = "s3://mybucket/txtfile.txt"
for (line <- Source.fromFile(hdr).getLines) {
    println(line)
}
  1. u can give bootstrap script, when the EMR comesup and the bootstrap script (.sh file) can access s3 file (i have used this multiple times)
  2. u can submit EMR steps, which executes a jar file and the jar may access the s3

I guess that most AWS setup already has the credentials configured in your EMR cluster using the default credential chain and default region provider chain. This should also apply to AWS Lambda. So to access my S3 buckets from EMR cluster, I simply had to use AWSS3ClientBuilder

import com.amazonaws.services.s3.AmazonS3ClientBuilder
import java.io.File
import java.nio.file.{Files, StandardCopyOption}

val bucket ="s3_bucket"
val file_in_s3 = "somefile.txt"
val dest = "/tmp/local_file.txt"
val s3 = AmazonS3ClientBuilder.defaultClient()
val stream = s3.getObject(bucket, file_in_s3).getObjectContent

Files.copy(stream, new File(dest).toPath, StandardCopyOption.REPLACE_EXISTING)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM