How to write contents of RDD onto S3?

Question

I have an RDD containing vertex id and it's x,y coordinates, I want to write its contents to a text file, on my local machine I'm doing that using a function

  def printVertices(iterable: Iterable[Vertex], filename: String): Unit = {
    val pw = new PrintWriter(new File(filename))
    for (point <- iterable) {
      pw.write(point.id + ", " + point.coordinate.x + ", " + point.coordinate.y + "\n")
    }
    pw.close()
  }

printVertices(dt.points.collect, s"$output/points$id.txt")

In the above code dt.points is an RDD, I want to save it to a text file if I do RDD.saveAsTextFile it writes the whole RDD, so I want to use my method and write to s3.

Answer 1

For writing RDD, as a text file, on S3, just add s3a to the URI. Like this:

printVertices(dt.points.collect, s"s3a://$bucketName/$output/points$id.txt")

Also, you have to include following JARs, if you are using Spark 2.2+

hadoop-aws-2.7.3.jar , and
aws-java-sdk-1.7.4.jar

Answer 2

You may consider to use seratch/AWScala library and the way to use it (by their docs) is

import awscala._, s3._
implicit val s3 = S3.at(Region.Tokyo)

val buckets: Seq[Bucket] = s3.buckets
val bucket: Bucket = s3.createBucket("unique-name-xxx")
val summaries: Seq[S3ObjectSummary] = bucket.objectSummaries

bucket.put("sample.txt", new java.io.File("sample.txt"))

so in your case, you need to get the bucket first

val bucket: Bucket = s3.bucket("your bucket unique name").get

and put the file in the bucket

bucket.put(s"$output/points$id.txt", new java.io.File("s"$output/points$id.txt""))

How to write contents of RDD onto S3?

Question

2 answers

solution1
0 2018-03-31 07:47:48

solution2
0 2018-04-01 04:28:08

How to write contents of RDD onto S3?

Question

2 answers

solution1 0 2018-03-31 07:47:48

solution2 0 2018-04-01 04:28:08

solution1
0 2018-03-31 07:47:48

solution2
0 2018-04-01 04:28:08