简体   繁体   English

使用Apache Spark在HDFS中存储多个文件

[英]Multiple file storage in HDFS using Apache Spark

I am doing a project that involves using HDFS for storage and Apache Spark for computation. 我正在做一个涉及使用HDFS进行存储和使用Apache Spark进行计算的项目。 I have a directory in HDFS which have several text files in it at same depth.I want to process all these files using Spark and store back their corresponding results back to HDFS with 1 output file for each input file. 我在HDFS中有一个目录,该目录中有多个深度相同的文本文件。我想使用Spark处理所有这些文件并将其对应的结果存储回HDFS,每个输入文件有1个输出文件。

For example - Suppose I have a directory with 1000 text files in it at same depth. 例如-假设我在同一深度下有一个包含1000个文本文件的目录。 I am reading all these files using wildcards 我正在使用通配符读取所有这些文件


Then I process them using Spark and get a corresponding RDD and save that by using 然后我使用Spark处理它们并获得相应的RDD并通过使用保存


But it gives me the result of all the input files in one single file and I want to get each file, process them individually and store the output of each of them individually. 但这给了我所有输入文件的结果在一个文件中,我想获取每个文件,分别处理它们,并分别存储它们的输出。

What should be my next approach to achieve this ? 实现该目标的下一个方法是什么?
Thanks in advance! 提前致谢!

You can do this by using wholeTextFiles() , Note: the below approach process files one by one. 您可以通过使用WholeTextFiles()来做到这一点, 注意:以下方法一步一步地处理文件。

val data = sc.wholeTextFiles("hdfs://master:port/vijay/mywordcount/")

val files = data.map { case (filename, content) => filename}

def doSomething(file: String) = { 

 println (file);

 // your logic of processing a single file comes here

 val logData = sc.textFile(file);
 val numAs = logData.filter(line => line.contains("a")).count();
 println("Lines with a: %s".format(numAs));

 // save rdd of single file processed data to hdfs  comes here


files.collect.foreach( filename => {


where: 哪里:

  • hdfs://master:port/vijay/mywordcount/ --- your hdfs dir hdfs:// master:port / vijay / mywordcount /-您的hdfs目录
  • data - org.apache.spark.rdd.RDD[(String, String)] 数据-org.apache.spark.rdd.RDD [(String,String)]
  • files - org.apache.spark.rdd.RDD[String]- filenames 文件-org.apache.spark.rdd.RDD [String]-文件名
  • doSomething(filename) - your logic doSomething(filename)-您的逻辑

Update: multiple output files 更新:多个输出文件

/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

/* hadoop */

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat

/* java */
import java.io.Serializable;

import org.apache.log4j.Logger
import org.apache.log4j.Level

/* Custom TextOutput Format */
class RDDMultipleTextOutputFormat extends MultipleTextOutputFormat[Any, Any] {
  override def generateActualKey(key: Any, value: Any): Any =

  override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String =
    return key.asInstanceOf[String] +"-"+ name;   // for output hdfs://Ouptut_dir/inputFilename-part-****
  //return key.asInstanceOf[String] +"/"+ name;   // for output hdfs://Ouptut_dir/inputFilename/part-**** [inputFilename - as directory of its partFiles ]

/* Spark Context */
object Spark {
  val sc = new SparkContext(new SparkConf().setAppName("test").setMaster("local[*]"))

/* WordCount Processing */

object Process extends Serializable{
  def apply(filename: String): org.apache.spark.rdd.RDD[(String, String)]= {
    println("i am called.....")
    val simple_path = filename.split('/').last;
    val lines = Spark.sc.textFile(filename);
    val counts     = lines.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _); //(word,count)
    val fname_word_counts = counts.map( x => (simple_path,x._1+"\t"+ x._2));   // (filename,word\tcount)

object SimpleApp  {

        def main(args: Array[String]) {


            // input ans output paths
            val INPUT_PATH = "hdfs://master:8020/vijay/mywordcount/"
            val OUTPUT_PATH = "hdfs://master:8020/vijay/mywordcount/output/"

            // context
            val context = Spark.sc
            val data = context.wholeTextFiles(INPUT_PATH)

            // final output RDD
            var output : org.apache.spark.rdd.RDD[(String, String)] = context.emptyRDD

            // files to process
            val files = data.map { case (filename, content) => filename}

            // Apply wordcount Processing on each File received in wholeTextFiles.
            files.collect.foreach( filename => {
                            output = output.union(Process(filename));

           //output.saveAsTextFile(OUTPUT_PATH);   // this will save output as (filename,word\tcount)
           output.saveAsHadoopFile(OUTPUT_PATH, classOf[String], classOf[String],classOf[RDDMultipleTextOutputFormat])  // custom output Format.

           //close context


environment: 环境:

  • Scala compiler version 2.10.2 Scala编译器版本2.10.2
  • spark-1.2.0-bin-hadoop2.3 火花1.2.0彬hadoop2.3
  • Hadoop 2.3.0-cdh5.0.3 Hadoop 2.3.0-cdh5.0.3

sample output: 样本输出:

[ramisetty@node-1 stack]$ hadoop fs -ls /vijay/mywordcount/output
Found 5 items
-rw-r--r--   3 ramisetty supergroup          0 2015-06-09 03:49 /vijay/mywordcount/output/_SUCCESS
-rw-r--r--   3 ramisetty supergroup         40 2015-06-09 03:49 /vijay/mywordcount/output/file1.txt-part-00000
-rw-r--r--   3 ramisetty supergroup          8 2015-06-09 03:49 /vijay/mywordcount/output/file1.txt-part-00001
-rw-r--r--   3 ramisetty supergroup         44 2015-06-09 03:49 /vijay/mywordcount/output/file2.txt-part-00002
-rw-r--r--   3 ramisetty supergroup          8 2015-06-09 03:49 /vijay/mywordcount/output/file2.txt-part-00003

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM