简体繁体 English

分区数如何影响`wholeTextFiles`和`textFiles`？

[英]How does the number of partitions affect `wholeTextFiles` and `textFiles`?

原文 2015-11-25 00:40:54 0 1 python/ apache-spark/ pyspark

In the spark, I understand how to use wholeTextFiles and textFiles , but I'm not sure which to use when. 在spark中，我理解如何使用wholeTextFiles和textFiles ，但我不确定在何时使用。 Here is what I know so far: 这是我目前所知道的：

When dealing with files that are not split by line, one should use wholeTextFiles , otherwise use textFiles . 处理未按行拆分的文件时，应使用wholeTextFiles ，否则使用textFiles 。

I would think that by default, wholeTextFiles and textFiles partition by file content, and by lines, respectively. 我认为默认情况下， wholeTextFiles和textFiles按文件内容和行分隔。 But, both of them allow you to change the parameter minPartitions . 但是，它们都允许您更改参数minPartitions 。

So, how does changing the partitions affect how these are processed? 那么，更改分区如何影响这些处理的方式呢？

For example, say I have one very large file with 100 lines. 例如，假设我有一个包含100行的非常大的文件。 What would be the difference between processing it as wholeTextFiles with 100 partiions, and processing it as textFile (which partitions it line by line) using the default of parition 100. 将它作为wholeTextFiles与100个parti处理，并使用默认的parition 100将其作为textFile （ textFile分割）处理之间有什么区别？

What is the difference between these? 这些有什么区别？

1 个解决方案

For reference, wholeTextFiles uses WholeTextFileInputFormat which extends CombineFileInputFormat . 作为参考， wholeTextFiles使用WholeTextFileInputFormat ，它扩展了CombineFileInputFormat 。

A couple of notes on wholeTextFiles . 关于wholeTextFiles 。

Each record in the RDD returned by wholeTextFiles has the file name and the entire contents of the file. wholeTextFiles返回的RDD中的每条记录都有文件名和文件的全部内容。 This means that a file cannot be split (at all). 这意味着无法拆分文件（根本）。
Because it extends CombineFileInputFormat , it will try to combine groups of smaller files into one partition. 因为它扩展了CombineFileInputFormat ，所以它会尝试将较小文件组合并到一个分区中。

If I have two small files in a directory, it is possible that both files will end up in a single partition. 如果我在目录中有两个小文件，则两个文件可能最终都在一个分区中。 If I set minPartitions=2 , then I will likely get two partitions back instead. 如果我设置minPartitions=2 ，那么我可能会返回两个分区。

Now if I were to set minPartitions=3 , I will still get back two partitions because the contract for wholeTextFiles is that each record in the RDD contain an entire file. 现在，如果我设置minPartitions=3 ，我仍然会返回两个分区，因为wholeTextFiles的合同是RDD中的每个记录都包含整个文件。