简体   繁体   English

Spark textFile 与 WholeTextFiles

[英]Spark textFile vs wholeTextFiles

I understand the basic theory of textFile generating partition for each file, while wholeTextFiles generates an RDD of pair values, where the key is the path of each file, the value is the content of each file.我理解了textFile为每个文件生成分区的基本原理,而wholeTextFiles生成了一个pair值的RDD,其中key是每个文件的路径,value是每个文件的内容。

Now, from a technical point of view, what's the difference between :现在,从技术的角度来看,两者之间有什么区别:

val textFile = sc.textFile("my/path/*.csv", 8)
textFile.getNumPartitions

and

val textFile = sc.wholeTextFiles("my/path/*.csv",8)
textFile.getNumPartitions

In both methods I'm generating 8 partitions.在这两种方法中,我都生成了 8 个分区。 So why should I use wholeTextFiles in the first place, and what's its benefit over textFile ?那么为什么我首先要使用wholeTextFiles ,它比textFile什么好处呢?

The main difference, as you mentioned, is that textFile will return an RDD with each line as an element while wholeTextFiles returns a PairRDD with the key being the file path.正如您所提到的,主要区别在于textFile将返回一个 RDD,每行作为一个元素,而wholeTextFiles返回一个 PairRDD,键是文件路径。 If there is no need to separate the data depending on the file, simply use textFile .如果不需要根据文件分离数据,只需使用textFile

When reading uncompressed files with textFile , it will split the data into chuncks of 32MB.使用textFile读取未压缩文件时,它将数据拆分为 32MB 的块。 This is advantagous from a memory perspective.从内存的角度来看,这是有利的。 This also means that the ordering of the lines is lost, if the order should be preserved then wholeTextFiles should be used.这也意味着行的顺序丢失了,如果应该保留顺序,那么应该使用wholeTextFiles

wholeTextFiles will read the complete content of a file at once, it won't be partially spilled to disk or partially garbage collected. wholeTextFiles将一次读取文件的完整内容,它不会部分溢出到磁盘或部分垃圾收集。 Each file will be handled by one core and the data for each file will be one a single machine making it harder to distribute the load.每个文件将由一个核心处理,每个文件的数据将在一台机器上存储,这使得负载分配变得更加困难。

textFile generating partition for each file, while wholeTextFiles generates an RDD of pair values textFile为每个文件生成分区,而wholeTextFiles生成对值的 RDD

That's not accurate:这不准确:

  1. textFile loads one or more files, with each line as a record in the resulting RDD. textFile加载一个或多个文件,每一作为结果 RDD 中的一条记录 A single file might be split into several partitions if the file is large enough (depends on the number of partitions requested, Spark's default number of partitions, and the underlying File System).如果文件足够大(取决于请求的分区数、Spark 的默认分区数和底层文件系统),单个文件可能会被拆分为多个分区。 When loading multiple files at once, this operation "loses" the relation between a record and the file that contained it - ie there's no way to know which file contained which line.一次加载多个文件时,此操作“丢失”了记录与包含它的文件之间的关系 - 即无法知道哪个文件包含哪一行。 The order of the records in the RDD will follow the alphabetical order of files, and the order of records within the files (order is not "lost"). RDD 中记录的顺序将遵循文件的字母顺序,以及文件中记录的顺序(顺序不会“丢失”)。

  2. wholeTextFiles preserves the relation between data and the files that contained it, by loading the data into a PairRDD with one record per input file . wholeTextFiles通过将数据加载到PairRDD每个输入文件一条记录wholeTextFiles保留数据与包含它的文件之间的关系。 The record will have the form (fileName, fileContent) .该记录的格式为(fileName, fileContent) This means that loading large files is risky (might cause bad performance or OutOfMemoryError since each file will necessarily be stored on a single node).这意味着加载大文件是有风险的(可能会导致性能不佳或OutOfMemoryError因为每个文件都必须存储在单个节点上)。 Partitioning is done based on user input or Spark's configuration - with multiple files potentially loaded into a single partition.分区是根据用户输入或 Spark 的配置完成的 - 多个文件可能加载到单个分区中。

Generally speaking, textFile serves the common use case of just loading a lot of data (regardless of how it's broken-down into files).一般来说, textFile服务于仅加载大量数据的常见用例(无论它如何分解为文件)。 readWholeFiles should only be used if you actually need to know the originating file name of each record, and if you know all files are small enough.仅当您确实需要知道每个记录的原始文件名,并且您知道所有文件都足够readWholeFiles ,才应使用readWholeFiles

As of Spark2.1.1 following is the code for textFile.从 Spark2.1.1 开始,以下是 textFile 的代码。

def textFile(
  path: String,
  minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
assertNotStopped()

hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
  minPartitions).map(pair => pair._2.toString).setName(path)  }

Which internally uses hadoopFile to read either local files, HDFS files, and S3 using the pattern like file:// , hdfs:// , and s3a://它在内部使用 hadoopFile 读取本地文件、HDFS 文件和 S3,使用类似file://hdfs://s3a://

Where as WholeTextFile the syntax is as below WholeTextFile 的语法如下

def wholeTextFiles(
  path: String,
  minPartitions: Int = defaultMinPartitions): RDD[(String, String)] = withScope 

If we observe the syntax for the both methods are equal, but textfile is useful to read the files, where as wholeTextFiles is used to read the directories of small files .如果我们观察到这两种方法的语法是相同的,但是textfile用于读取文件,而WholeTextFiles用于读取小文件目录 How ever we can also use larger files but performance may effect.我们也可以使用更大的文件,但性能可能会受到影响。
So when you want to deal with large files textFile is better option, whereas if we want to deal with directory of smaller files wholeTextFile is better所以当你想处理大文件时,textFile 是更好的选择,而如果我们想处理小文件的目录,wholeTextFile 更好

  1. textfile() reads a text file and returns an RDD of Strings. textfile() 读取文本文件并返回字符串的 RDD。 For example sc.textFile("/mydata.txt") will create RDD in which each individual line is an element.例如 sc.textFile("/mydata.txt") 将创建 RDD,其中每一行都是一个元素。

  2. wholeTextFile() reads a directory of text files and returns pairRDD. WholeTextFile() 读取文本文件目录并返回pairRDD。 For example, if there are few files in a directory, the wholeTextFile() method will create pair RDD with filename and path as key, and value being the whole file as string.例如,如果目录中的文件很少,则 WholeTextFile() 方法将创建以文件名和路径为键,值是整个文件作为字符串的对 RDD。

See below example for clarity:-为清楚起见,请参见以下示例:-

textFile = sc.textFile("ml-100k/u1.data")
textFile.getNumPartitions()

Output- 2输出- 2
ie 2 partitions即 2 个分区

textFile = sc.wholeTextFiles("ml-100k/u1.data")
textFile.getNumPartitions()

Output - 1输出 - 1
ie Only one partition.即只有一个分区。

So in short wholeTextFiles简而言之, wholeTextFiles

Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI.从 HDFS、本地文件系统(在所有节点上可用)或任何 Hadoop 支持的文件系统 URI 中读取文本文件目录。 Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.每个文件被读取为一条记录并以键值对的形式返回,其中键是每个文件的路径,值是每个文件的内容。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM