How to write streaming data to S3?

Question

I want to write RDD[String] to Amazon S3 in Spark Streaming using Scala. These are basically JSON strings. Not sure how to do it more efficiently. I found this post , in which the library spark-s3 is used. The idea is to create SparkContext and then SQLContext . After this the author of the post does something like this:

myDstream.foreachRDD { rdd =>
      rdd.toDF().write
                .format("com.knoldus.spark.s3")
                .option("accessKey","s3_access_key")
                .option("secretKey","s3_secret_key")
                .option("bucket","bucket_name")
                .option("fileType","json")
                .save("sample.json")
}

What are another options besides spark-s3 ? Is it possible to append the file on S3 with the streaming data?

Answer 1

Files on S3 cannot be appended . An "append" means in S3 to replace the existing object with a new object that contains the additional data.

Answer 2

You should take a look into mode method for dataframewriter in Spark Documentation :

public DataFrameWriter mode(SaveMode saveMode)

Specifies the behavior when data or table already exists. Options include: - SaveMode.Overwrite: overwrite the existing data. - SaveMode.Append: append the data . - SaveMode.Ignore: ignore the operation (ie no-op). - SaveMode.ErrorIfExists: default option, throw an exception at runtime.

You can try somethling like this with Append savemode.

rdd.toDF.write
        .format("json")
        .mode(SaveMode.Append)
        .saveAsTextFile("s3://iiiii/ttttt.json");

Spark Append:

Append mode means that when saving a DataFrame to a data source, if data/table already exists, contents of the DataFrame are expected to be appended to existing data.

Basically you can choose which format you want as an output format by passing "format" keyword to method

public DataFrameWriter format(java.lang.String source)

Specifies the underlying output data source. Built-in options include "parquet", "json", etc.

eg as parquet :

df.write().format("parquet").save("yourfile.parquet")

or as json :

df.write().format("json").save("yourfile.json")

Edit: Added details about s3 credentials:

there are two different options how to set credentials and we can see this in SparkHadoopUtil.scala with environment variables System.getenv("AWS_ACCESS_KEY_ID") or with spark.hadoop.foo property:

SparkHadoopUtil.scala:
if (key.startsWith("spark.hadoop.")) {
          hadoopConf.set(key.substring("spark.hadoop.".length), value)
}

so, you need to get hadoopConfiguration in javaSparkContext.hadoopConfiguration() or scalaSparkContext.hadoopConfiguration and set

hadoopConfiguration.set("fs.s3.awsAccessKeyId", myAccessKey)
hadoopConfiguration.set("fs.s3.awsSecretAccessKey", mySecretKey)

How to write streaming data to S3?

Question

2 answers

solution1
3 2016-10-11 14:20:40

solution2
2 ACCPTED 2016-10-10 17:11:39

How to write streaming data to S3?

Question

2 answers

solution1 3 2016-10-11 14:20:40

solution2 2 ACCPTED 2016-10-10 17:11:39

solution1
3 2016-10-11 14:20:40

solution2
2 ACCPTED 2016-10-10 17:11:39