spark sc.textfile如何详细工作？

Question

I want to find how sc.textfile works in detail. 我想找到sc.textfile工作原理。
I have found the textfile source code in SparkContext.scala but they contain so much infomation about scheduler, stage and task submitted. 我在SparkContext.scala中找到了文本文件源代码，但它们包含有关调度程序，阶段和任务提交的大量信息。 What I want is how sc.textfile reads files from hdfs and how sc.textfile uses wildcard to match multiple files. 我想要的是sc.textfile如何从hdfs读取文件以及sc.textfile如何使用通配符来匹配多个文件。
Where can I find the source code? 在哪里可以找到源代码？

Answer 1

Apache Spark uses the Hadoop client library for reading the file. Apache Spark使用Hadoop客户端库读取文件。 So you have to read the hadoop-client source code to find out more: 因此，您必须阅读hadoop-client源代码以了解更多信息：

https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.java https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/ hadoop / mapreduce / lib / input / TextInputFormat.java

Answer 2

textFile is a method of a org.apache.spark.SparkContext class that reads a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings. textFile是org.apache.spark.SparkContext类的方法，该方法从HDFS，本地文件系统（在所有节点上都可用）或任何Hadoop支持的文件系统URI中读取文本文件，并将其作为字符串的RDD返回。

sc.textFile(path,minpartions)

> @param path path to the text file on a supported file system  
> @param minPartitions suggested minimum number of partitions for the resulting RDD
> @return RDD of lines of the text file

It internally uses hadoopRDD (An RDD that provides core functionality for reading data stored in Hadoop ) 它在内部使用hadoopRDD（RDD提供了用于读取Hadoop中存储的数据的核心功能）

Hadoop Rdd looks like this Hadoop Rdd看起来像这样

HadoopRDD(
      sc, //Sparkcontext
      confBroadcast, //A general Hadoop Configuration, or a subclass of it
      Some(setInputPathsFunc),//Optional closure used to initialize any JobConf that HadoopRDD creates.       inputFormatClass,
      keyClass,
      valueClass,
      minPartitions)

In the textFile method we call create a hadoopRDD with some hardcoded value: 在textFile方法中，我们调用创建带有一些硬编码值的hadoopRDD：

HadoopRDD(
      sc, //Sparkcontext
      confBroadcast, //A general Hadoop Configuration, or a subclass of it
      Some(setInputPathsFunc),//Optional closure used to initialize any JobConf that HadoopRDD creates. 
      classOf[TextInputFormat],
      classOf[LongWritable],
      classOf[Text],
      minPartitions)

Because of these hard coded values we are only able to read textfiles , so if we want to read any other type of file we use HadoopRdd . 由于这些硬编码值，我们只能读取文本文件，因此，如果要读取任何其他类型的文件，请使用HadoopRdd。

Answer 3

the compute function in core\\src\\main\\scala\\org\\apache\\spark\\rdd\\HadoopRDD.scala 核心\\ src \\ main \\ scala \\ org \\ apache \\ spark \\ rdd \\ HadoopRDD.scala中的计算功能

here are some code in the function below 这是下面函数中的一些代码

  var reader: RecordReader[K, V] = null
  val inputFormat = getInputFormat(jobConf)
  HadoopRDD.addLocalConfiguration(new SimpleDateFormat("yyyyMMddHHmm").format(createTime),
   context.stageId, theSplit.index, context.attemptNumber, jobConf)
  reader = inputFormat.getRecordReader(split.inputSplit.value, jobConf, Reporter.NULL)

  // Register an on-task-completion callback to close the input stream.
  context.addTaskCompletionListener{ context => closeIfNeeded() }
  val key: K = reader.createKey()
  val value: V = reader.createValue()

spark sc.textfile如何详细工作？

问题描述

3 个解决方案

解决方案1
2 已采纳 2016-03-30 11:03:55

解决方案2
1 2017-07-31 07:59:29

解决方案3
0 2016-03-30 12:09:09

spark sc.textfile如何详细工作？

问题描述

3 个解决方案

解决方案1 2 已采纳 2016-03-30 11:03:55

解决方案2 1 2017-07-31 07:59:29

解决方案3 0 2016-03-30 12:09:09

解决方案1
2 已采纳 2016-03-30 11:03:55

解决方案2
1 2017-07-31 07:59:29

解决方案3
0 2016-03-30 12:09:09