使用Apache Spark在HDFS中存储多个文件

Question

I am doing a project that involves using HDFS for storage and Apache Spark for computation. 我正在做一个涉及使用HDFS进行存储和使用Apache Spark进行计算的项目。 I have a directory in HDFS which have several text files in it at same depth.I want to process all these files using Spark and store back their corresponding results back to HDFS with 1 output file for each input file. 我在HDFS中有一个目录，该目录中有多个深度相同的文本文件。我想使用Spark处理所有这些文件并将其对应的结果存储回HDFS，每个输入文件有1个输出文件。

For example - Suppose I have a directory with 1000 text files in it at same depth. 例如-假设我在同一深度下有一个包含1000个文本文件的目录。 I am reading all these files using wildcards 我正在使用通配符读取所有这些文件

sc.wholeTextFiles(hdfs://localhost:9000/home/akshat/files/*.txt)

Then I process them using Spark and get a corresponding RDD and save that by using 然后我使用Spark处理它们并获得相应的RDD并通过使用保存

result.saveAsTextFile("hdfs://localhost:9000/home/akshat/final")

But it gives me the result of all the input files in one single file and I want to get each file, process them individually and store the output of each of them individually. 但这给了我所有输入文件的结果在一个文件中，我想获取每个文件，分别处理它们，并分别存储它们的输出。

What should be my next approach to achieve this ? 实现该目标的下一个方法是什么？
Thanks in advance! 提前致谢！

Answer 1

You can do this by using wholeTextFiles() , Note: the below approach process files one by one. 您可以通过使用WholeTextFiles（）来做到这一点， 注意：以下方法一步一步地处理文件。

val data = sc.wholeTextFiles("hdfs://master:port/vijay/mywordcount/")

val files = data.map { case (filename, content) => filename}


def doSomething(file: String) = { 

 println (file);

 // your logic of processing a single file comes here

 val logData = sc.textFile(file);
 val numAs = logData.filter(line => line.contains("a")).count();
 println("Lines with a: %s".format(numAs));

 // save rdd of single file processed data to hdfs  comes here

}

files.collect.foreach( filename => {
    doSomething(filename)

})

where: 哪里：

hdfs://master:port/vijay/mywordcount/ --- your hdfs dir hdfs：// master：port / vijay / mywordcount /-您的hdfs目录
data - org.apache.spark.rdd.RDD[(String, String)] 数据-org.apache.spark.rdd.RDD [（String，String）]
files - org.apache.spark.rdd.RDD[String]- filenames 文件-org.apache.spark.rdd.RDD [String]-文件名
doSomething(filename) - your logic doSomething（filename）-您的逻辑

Update: multiple output files 更新：多个输出文件

/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

/* hadoop */

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat

/* java */
import java.io.Serializable;

import org.apache.log4j.Logger
import org.apache.log4j.Level

/* Custom TextOutput Format */
class RDDMultipleTextOutputFormat extends MultipleTextOutputFormat[Any, Any] {
  override def generateActualKey(key: Any, value: Any): Any =
    NullWritable.get()

  override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String =
    return key.asInstanceOf[String] +"-"+ name;   // for output hdfs://Ouptut_dir/inputFilename-part-****
  //return key.asInstanceOf[String] +"/"+ name;   // for output hdfs://Ouptut_dir/inputFilename/part-**** [inputFilename - as directory of its partFiles ]
}

/* Spark Context */
object Spark {
  val sc = new SparkContext(new SparkConf().setAppName("test").setMaster("local[*]"))
}

/* WordCount Processing */

object Process extends Serializable{
  def apply(filename: String): org.apache.spark.rdd.RDD[(String, String)]= {
    println("i am called.....")
    val simple_path = filename.split('/').last;
    val lines = Spark.sc.textFile(filename);
    val counts     = lines.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _); //(word,count)
    val fname_word_counts = counts.map( x => (simple_path,x._1+"\t"+ x._2));   // (filename,word\tcount)
    fname_word_counts
  }
}

object SimpleApp  {

        def main(args: Array[String]) {

            //Logger.getLogger("org").setLevel(Level.OFF)
            //Logger.getLogger("akka").setLevel(Level.OFF)

            // input ans output paths
            val INPUT_PATH = "hdfs://master:8020/vijay/mywordcount/"
            val OUTPUT_PATH = "hdfs://master:8020/vijay/mywordcount/output/"

            // context
            val context = Spark.sc
            val data = context.wholeTextFiles(INPUT_PATH)

            // final output RDD
            var output : org.apache.spark.rdd.RDD[(String, String)] = context.emptyRDD

            // files to process
            val files = data.map { case (filename, content) => filename}


            // Apply wordcount Processing on each File received in wholeTextFiles.
            files.collect.foreach( filename => {
                            output = output.union(Process(filename));
            })

           //output.saveAsTextFile(OUTPUT_PATH);   // this will save output as (filename,word\tcount)
           output.saveAsHadoopFile(OUTPUT_PATH, classOf[String], classOf[String],classOf[RDDMultipleTextOutputFormat])  // custom output Format.

           //close context
           context.stop();

         }
}

environment: 环境：

Scala compiler version 2.10.2 Scala编译器版本2.10.2
spark-1.2.0-bin-hadoop2.3 火花1.2.0彬hadoop2.3
Hadoop 2.3.0-cdh5.0.3 Hadoop 2.3.0-cdh5.0.3

sample output: 样本输出：

[ramisetty@node-1 stack]$ hadoop fs -ls /vijay/mywordcount/output
Found 5 items
-rw-r--r--   3 ramisetty supergroup          0 2015-06-09 03:49 /vijay/mywordcount/output/_SUCCESS
-rw-r--r--   3 ramisetty supergroup         40 2015-06-09 03:49 /vijay/mywordcount/output/file1.txt-part-00000
-rw-r--r--   3 ramisetty supergroup          8 2015-06-09 03:49 /vijay/mywordcount/output/file1.txt-part-00001
-rw-r--r--   3 ramisetty supergroup         44 2015-06-09 03:49 /vijay/mywordcount/output/file2.txt-part-00002
-rw-r--r--   3 ramisetty supergroup          8 2015-06-09 03:49 /vijay/mywordcount/output/file2.txt-part-00003

使用Apache Spark在HDFS中存储多个文件

问题描述

1 个解决方案

解决方案1
2 已采纳 2015-06-04 12:39:12

使用Apache Spark在HDFS中存储多个文件

问题描述

1 个解决方案

解决方案1 2 已采纳 2015-06-04 12:39:12

解决方案1
2 已采纳 2015-06-04 12:39:12