简体   繁体   中英

Spark: read csv file from s3 using scala

I am writing a spark job, trying to read a text file using scala, the following works fine on my local machine.

  val myFile = "myLocalPath/myFile.csv"
  for (line <- Source.fromFile(myFile).getLines()) {
    val data = line.split(",")
    myHashMap.put(data(0), data(1).toDouble)
  }

Then I tried to make it work on AWS, I did the following, but it didn't seem to read the entire file properly. What should be the proper way to read such text file on s3? Thanks a lot!

val credentials = new BasicAWSCredentials("myKey", "mySecretKey");
val s3Client = new AmazonS3Client(credentials);
val s3Object = s3Client.getObject(new GetObjectRequest("myBucket", "myFile.csv"));

val reader = new BufferedReader(new InputStreamReader(s3Object.getObjectContent()));

var line = ""
while ((line = reader.readLine()) != null) {
      val data = line.split(",")
      myHashMap.put(data(0), data(1).toDouble)
      println(line);
}

I think I got it work like below:

    val s3Object= s3Client.getObject(new GetObjectRequest("myBucket", "myPath/myFile.csv"));

    val myData = Source.fromInputStream(s3Object.getObjectContent()).getLines()
    for (line <- myData) {
        val data = line.split(",")
        myMap.put(data(0), data(1).toDouble)
    }

    println(" my map : " + myMap.toString())

Read in csv-file with sc.textFile("s3://myBucket/myFile.csv") . That will give you an RDD[String]. Get that into a map

val myHashMap = data.collect
                    .map(line => {
                      val substrings = line.split(" ")
                      (substrings(0), substrings(1).toDouble)})
                    .toMap

You can the use sc.broadcast to broadcast your map, so that it is readily available on all your worker nodes.

(Note that you can of course also use the Databricks "spark-csv" package to read in the csv-file if you prefer.)

This can be acheived even withoutout importing amazons3 libraries using SparkContext textfile . Use the below code

import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.conf.Configuration
val s3Login = "s3://AccessKey:Securitykey@Externalbucket"
val filePath = s3Login + "/Myfolder/myscv.csv"
for (line <- sc.textFile(filePath).collect())
{
    var data = line.split(",")
    var value1 = data(0)
    var value2 = data(1).toDouble
}

In the above code, sc.textFile will read the data from your file and store in the line RDD. It then split each line with , to a different RDD data inside the loop. Then you can access values from this RDD with the index.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM