Can I write a plain text HDFS (or local) file from a Spark program, not from an RDD?

Question

I have a Spark program (in Scala) and a SparkContext . I am writing some files with RDD 's saveAsTextFile . On my local machine I can use a local file path and it works with the local file system. On my cluster it works with HDFS.

I also want to write other arbitrary files as the result of processing. I'm writing them as regular files on my local machine, but want them to go into HDFS on the cluster.

SparkContext seems to have a few file-related methods but they all seem to be inputs not outputs.

How do I do this?

Answer 1

Thanks to marios and kostya, but there are few steps to writing a text file into HDFS from Spark.

// Hadoop Config is accessible from SparkContext
val fs = FileSystem.get(sparkContext.hadoopConfiguration); 

// Output file can be created from file system.
val output = fs.create(new Path(filename));

// But BufferedOutputStream must be used to output an actual text file.
val os = BufferedOutputStream(output)

os.write("Hello World".getBytes("UTF-8"))

os.close()

Note that FSDataOutputStream , which has been suggested, is a Java serialized object output stream, not a text output stream. The writeUTF method appears to write plaint text, but it's actually a binary serialization format that includes extra bytes.

Answer 2

Here's what worked best for me (using Spark 2.0):

val path = new Path("hdfs://namenode:8020/some/folder/myfile.txt")
val conf = new Configuration(spark.sparkContext.hadoopConfiguration)
conf.setInt("dfs.blocksize", 16 * 1024 * 1024) // 16MB HDFS Block Size
val fs = path.getFileSystem(conf)
if (fs.exists(path))
    fs.delete(path, true)
val out = new BufferedOutputStream(fs.create(path)))
val txt = "Some text to output"
out.write(txt.getBytes("UTF-8"))
out.flush()
out.close()
fs.close()

Answer 3

Using HDFS API (hadoop-hdfs.jar) you can create InputStream/OutputStream for an HDFS path and read from/write to a file using regular java.io classes. For example:

URI uri = URI.create (“hdfs://host:port/file path”);
Configuration conf = new Configuration();
FileSystem file = FileSystem.get(uri, conf);
FSDataInputStream in = file.open(new Path(uri));

This code will work with local files as well (change hdfs:// to file:// ).

Answer 4

One simple way to write files to HDFS is to use a SequenceFiles . Here you use the native Hadoop APIs and not the ones provided by Spark.

Here is a simple snippet (in Scala):

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
import org.apache.hadoop.io._ 

val conf = new Configuration() // Hadoop configuration 
val sfwriter = SequenceFile.createWriter(conf,
              SequenceFile.Writer.file(new Path("hdfs://nn1.example.com/file1")),
              SequenceFile.Writer.keyClass(LongWritable.class),
              SequenceFile.Writer.valueClass(Text.class))
val lw = new LongWritable()
val txt = new Text()
lw.set(12)
text.set("hello")
sfwriter.append(lw, txt)
sfwriter.close()
...

In case you don't have a key you can use NullWritable.class in its place:

SequenceFile.Writer.keyClass(NullWritable.class)
sfwriter.append(NullWritable.get(), new Text("12345"));

Can I write a plain text HDFS (or local) file from a Spark program, not from an RDD?

Question

4 answers

solution1
10 ACCPTED 2015-10-06 12:15:57

solution2
4 2016-11-23 14:31:11

solution3
2 2015-10-05 16:44:56

solution4
2 2015-10-05 16:45:30

Can I write a plain text HDFS (or local) file from a Spark program, not from an RDD?

Question

4 answers

solution1 10 ACCPTED 2015-10-06 12:15:57

solution2 4 2016-11-23 14:31:11

solution3 2 2015-10-05 16:44:56

solution4 2 2015-10-05 16:45:30

solution1
10 ACCPTED 2015-10-06 12:15:57

solution2
4 2016-11-23 14:31:11

solution3
2 2015-10-05 16:44:56

solution4
2 2015-10-05 16:45:30