如何使用循环在Spark-Scala中迭代HDFS中的多个文本文件？

Question

I'm working in a cluster. 我在集群中工作。 I need to run the same spark operation for each text file contained in HDFS. 我需要为HDFS中包含的每个文本文件运行相同的spark操作。 But I want to do that without submitting spark job shell-command for each file from shell command line, because the number of files is 90. How can I do that? 但是我想在没有从shell命令行为每个文件提交spark job shell-command的情况下这样做，因为文件的数量是90.我该怎么做？

My code for one file is structured as follow: 我的一个文件的代码结构如下：

object SparkGraphGen{
def main(args: Array[String]){
      val conf = new SparkConf()
                .setMaster("yarn")
                .setAppName("dataset")
      val sc = new SparkContext(conf)
      val sqlContext = new org.apache.spark.sql.SQLContext(sc)
      import sqlContext.implicits._
      val peopleRDD = sc.textFile("file1.csv")
      ...
      do stuff
      ...
      sc.stop()
      }}

Answer 1

update: 更新：

how about foreach loop: foreach循环怎么样：

 val sc = new SparkContext(conf) //val files = new File("Data\\\\files\\\\").listFiles.map(_.getAbsolutePath).toList val files = new File("Data\\\\files\\\\").listFiles.map(_.getName).toList files.foreach { file => //val lines = sc.textFile(file) val lines = sc.textFile("Data\\\\files\\\\" + file) println("total lines in file " + file + " " + lines.count()) //do more stuf... for each file lines.saveAsTextFile("Data\\\\output\\\\" + file + "_output") } sc.stop()

output: 输出：

 total lines in file C:\\Users\\rpatel\\workspaces\\Spark\\Data\\files\\file1.txt 4 total lines in file C:\\Users\\rpatel\\workspaces\\Spark\\Data\\files\\file2.txt 4

you can also write same for loop in shell-script 你也可以在shell脚本中编写相同的for循环

 #!/bin/bash for file in $(hadoop fs -ls /hdfs/path/to/files/|awk -F '|' '{print $NF}') do #run spark for each file spark-submit <options> $file /path/output/$file done

or process all files in one shot.... 或一次性处理所有文件....

you can put all files in one directory and pass only full directory path to spark context, spark will process all files in that directory: 您可以将所有文件放在一个目录中，只将完整目录路径传递给spark上下文，spark将处理该目录中的所有文件：

val peopleRDD = sc.textFile("/path/to/csv_files/")

you can also combine RDDs like: 您还可以组合RDD，如：

    val file1RDD = sc.textFile("file1.csv") 
    val file2RDD = sc.textFile("file2.csv")
    val allFileRDD = file1RDD ++ file2RDD // ++ nRDD

but with 90 files, I would put all files in one directory and use directory path to process all in one job... 但是有90个文件，我会将所有文件放在一个目录中并使用目录路径来处理所有作业...

如何使用循环在Spark-Scala中迭代HDFS中的多个文本文件？

问题描述

1 个解决方案

解决方案1
3 已采纳 2017-01-20 19:32:10

如何使用循环在Spark-Scala中迭代HDFS中的多个文本文件？

问题描述

1 个解决方案

解决方案1 3 已采纳 2017-01-20 19:32:10

解决方案1
3 已采纳 2017-01-20 19:32:10