简体   繁体   English

Spark:使用scala从s3读取csv文件

[英]Spark: read csv file from s3 using scala

I am writing a spark job, trying to read a text file using scala, the following works fine on my local machine. 我正在编写一个spark作业,尝试使用scala读取文本文件,以下在我的本地计算机上工作正常。

  val myFile = "myLocalPath/myFile.csv"
  for (line <- Source.fromFile(myFile).getLines()) {
    val data = line.split(",")
    myHashMap.put(data(0), data(1).toDouble)
  }

Then I tried to make it work on AWS, I did the following, but it didn't seem to read the entire file properly. 然后我尝试让它在AWS上运行,我做了以下操作,但它似乎没有正确读取整个文件。 What should be the proper way to read such text file on s3? 什么应该是在s3上阅读此类文本文件的正确方法? Thanks a lot! 非常感谢!

val credentials = new BasicAWSCredentials("myKey", "mySecretKey");
val s3Client = new AmazonS3Client(credentials);
val s3Object = s3Client.getObject(new GetObjectRequest("myBucket", "myFile.csv"));

val reader = new BufferedReader(new InputStreamReader(s3Object.getObjectContent()));

var line = ""
while ((line = reader.readLine()) != null) {
      val data = line.split(",")
      myHashMap.put(data(0), data(1).toDouble)
      println(line);
}

I think I got it work like below: 我想我的工作如下:

    val s3Object= s3Client.getObject(new GetObjectRequest("myBucket", "myPath/myFile.csv"));

    val myData = Source.fromInputStream(s3Object.getObjectContent()).getLines()
    for (line <- myData) {
        val data = line.split(",")
        myMap.put(data(0), data(1).toDouble)
    }

    println(" my map : " + myMap.toString())

Read in csv-file with sc.textFile("s3://myBucket/myFile.csv") . 使用sc.textFile("s3://myBucket/myFile.csv")读入csv文件。 That will give you an RDD[String]. 那会给你一个RDD [String]。 Get that into a map 把它变成地图

val myHashMap = data.collect
                    .map(line => {
                      val substrings = line.split(" ")
                      (substrings(0), substrings(1).toDouble)})
                    .toMap

You can the use sc.broadcast to broadcast your map, so that it is readily available on all your worker nodes. 您可以使用sc.broadcast广播您的地图,以便它可以在您的所有工作节点上随时使用。

(Note that you can of course also use the Databricks "spark-csv" package to read in the csv-file if you prefer.) (请注意,如果您愿意,您当然也可以使用Databricks“spark-csv”包来读取csv文件。)

This can be acheived even withoutout importing amazons3 libraries using SparkContext textfile . 即使没有使用SparkContext textfile导入amazons3库,也可以实现这SparkContext textfile Use the below code 使用以下代码

import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.conf.Configuration
val s3Login = "s3://AccessKey:Securitykey@Externalbucket"
val filePath = s3Login + "/Myfolder/myscv.csv"
for (line <- sc.textFile(filePath).collect())
{
    var data = line.split(",")
    var value1 = data(0)
    var value2 = data(1).toDouble
}

In the above code, sc.textFile will read the data from your file and store in the line RDD. 在上面的代码中, sc.textFile将从您的文件中读取数据并存储在RDD line It then split each line with , to a different RDD data inside the loop. 然后将它分割每条线,以不同的RDD data在循环内。 Then you can access values from this RDD with the index. 然后,您可以使用索引从此RDD访问值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM