[英]How does the number of partitions affect `wholeTextFiles` and `textFiles`?
In the spark, I understand how to use wholeTextFiles
and textFiles
, but I'm not sure which to use when. 在spark中,我理解如何使用
wholeTextFiles
和textFiles
,但我不确定在何时使用。 Here is what I know so far: 这是我目前所知道的:
wholeTextFiles
, otherwise use textFiles
. wholeTextFiles
,否则使用textFiles
。 I would think that by default, wholeTextFiles
and textFiles
partition by file content, and by lines, respectively. 我认为默认情况下,
wholeTextFiles
和textFiles
按文件内容和行分隔。 But, both of them allow you to change the parameter minPartitions
. 但是,它们都允许您更改参数
minPartitions
。
So, how does changing the partitions affect how these are processed? 那么,更改分区如何影响这些处理的方式呢?
For example, say I have one very large file with 100 lines. 例如,假设我有一个包含100行的非常大的文件。 What would be the difference between processing it as
wholeTextFiles
with 100 partiions, and processing it as textFile
(which partitions it line by line) using the default of parition 100. 将它作为
wholeTextFiles
与100个parti处理,并使用默认的parition 100将其作为textFile
( textFile
分割)处理之间有什么区别?
What is the difference between these? 这些有什么区别?
For reference, wholeTextFiles
uses WholeTextFileInputFormat
which extends CombineFileInputFormat . 作为参考,
wholeTextFiles
使用WholeTextFileInputFormat
,它扩展了CombineFileInputFormat 。
A couple of notes on wholeTextFiles
. 关于
wholeTextFiles
。
wholeTextFiles
has the file name and the entire contents of the file. wholeTextFiles
返回的RDD中的每条记录都有文件名和文件的全部内容。 This means that a file cannot be split (at all). CombineFileInputFormat
, it will try to combine groups of smaller files into one partition. CombineFileInputFormat
,所以它会尝试将较小文件组合并到一个分区中。 If I have two small files in a directory, it is possible that both files will end up in a single partition. 如果我在目录中有两个小文件,则两个文件可能最终都在一个分区中。 If I set
minPartitions=2
, then I will likely get two partitions back instead. 如果我设置
minPartitions=2
,那么我可能会返回两个分区。
Now if I were to set minPartitions=3
, I will still get back two partitions because the contract for wholeTextFiles
is that each record in the RDD contain an entire file. 现在,如果我设置
minPartitions=3
,我仍然会返回两个分区,因为wholeTextFiles
的合同是RDD中的每个记录都包含整个文件。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.