简体   繁体   English

分区数如何影响`wholeTextFiles`和`textFiles`?

[英]How does the number of partitions affect `wholeTextFiles` and `textFiles`?

In the spark, I understand how to use wholeTextFiles and textFiles , but I'm not sure which to use when. 在spark中,我理解如何使用wholeTextFilestextFiles ,但我不确定在何时使用。 Here is what I know so far: 这是我目前所知道的:

  • When dealing with files that are not split by line, one should use wholeTextFiles , otherwise use textFiles . 处理未按行拆分的文件时,应使用wholeTextFiles ,否则使用textFiles

I would think that by default, wholeTextFiles and textFiles partition by file content, and by lines, respectively. 我认为默认情况下, wholeTextFilestextFiles按文件内容和行分隔。 But, both of them allow you to change the parameter minPartitions . 但是,它们都允许您更改参数minPartitions

So, how does changing the partitions affect how these are processed? 那么,更改分区如何影响这些处理的方式呢?

For example, say I have one very large file with 100 lines. 例如,假设我有一个包含100行的非常大的文件。 What would be the difference between processing it as wholeTextFiles with 100 partiions, and processing it as textFile (which partitions it line by line) using the default of parition 100. 将它作为wholeTextFiles与100个parti处理,并使用默认的parition 100将其作为textFiletextFile分割)处理之间有什么区别?

What is the difference between these? 这些有什么区别?

For reference, wholeTextFiles uses WholeTextFileInputFormat which extends CombineFileInputFormat . 作为参考, wholeTextFiles使用WholeTextFileInputFormat ,它扩展了CombineFileInputFormat

A couple of notes on wholeTextFiles . 关于wholeTextFiles

  • Each record in the RDD returned by wholeTextFiles has the file name and the entire contents of the file. wholeTextFiles返回的RDD中的每条记录都有文件名和文件的全部内容。 This means that a file cannot be split (at all). 这意味着无法拆分文件(根本)。
  • Because it extends CombineFileInputFormat , it will try to combine groups of smaller files into one partition. 因为它扩展了CombineFileInputFormat ,所以它会尝试将较小文件组合并到一个分区中。

If I have two small files in a directory, it is possible that both files will end up in a single partition. 如果我在目录中有两个小文件,则两个文件可能最终都在一个分区中。 If I set minPartitions=2 , then I will likely get two partitions back instead. 如果我设置minPartitions=2 ,那么我可能会返回两个分区。

Now if I were to set minPartitions=3 , I will still get back two partitions because the contract for wholeTextFiles is that each record in the RDD contain an entire file. 现在,如果我设置minPartitions=3 ,我仍然会返回两个分区,因为wholeTextFiles的合同是RDD中的每个记录都包含整个文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在pyspark中的列上重新分区如何影响分区数量? - How does repartitioning on a column in pyspark affect the number of partitions? LocalCluster() 如何影响任务数量? - How does LocalCluster() affect the number of tasks? 重新分区不影响任务数 - Repartition does not affect number of tasks 如何从前10个字符的标题编号链接的两个文本文件中提取行? - How to extract lines from two textfiles linked by heading number from the 1st 10 characters? 列数是否会影响sqlalchemy的速度? - Does the number of columns affect the speed of sqlalchemy? SparkContext.wholeTextFiles()java.io.FileNotFoundException:文件不存在: - SparkContext.wholeTextFiles() java.io.FileNotFoundException: File does not exist: Python - 如何在第三个元素或第四个元素处按奇数和偶数对文本文件进行排序,并生成奇数或偶数条目的 output 文本文件 - Python - How to sort textfile by odd and even number at third element or fourth element and produce output textfiles of odd or even numbered entries 如何在nltk中用hunpos标记文本文件? - How do I tag textfiles with hunpos in nltk? 生成数字分区的算法 - Algorithm for generating partitions of number Confluent kafka python API - 如何获取主题中的分区数 - Confluent kafka python API - how to get number of partitions in a topic
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM