Spark - 倾斜输入 dataframe

Question

i am working with a heavily nested non-splittable json format input dataset housed in S3.我正在使用位于 S3 中的高度嵌套的不可拆分 json 格式输入数据集。 The files can vary a lot in their sizes - minimum is 10kb while other is 300 MB.这些文件的大小可能有很大差异 - 最小为 10kb，而其他为 300 MB。

When reading the file using the below code, and just doing a simple repartition to desired number of partitions leads to straggling tasks - most tasks finish within seconds but one would last for couple hours and then runs into memory issues (heartbeat missing/ heap space etc.) I repartition in an attempt to randomize the partition to file mapping since spark may be reading files in sequence and files within same directory tend to have same nature -all large/all small etc.当使用下面的代码读取文件时，只是对所需数量的分区进行简单的重新分区会导致任务混乱——大多数任务在几秒钟内完成，但一个会持续几个小时，然后遇到 memory 问题（心跳丢失/堆空间等） .) 我重新分区以尝试将分区随机化为文件映射，因为 spark 可能正在按顺序读取文件，并且同一目录中的文件往往具有相同的性质 - 全大/全小等。

df = spark.read.json('s3://my/parent/directory')
df.repartition(396)

# Settings (few):
default parallelism = 396
total number of cores = 400

What I tried:我尝试了什么：

I figured that the input partitions (s3 partitions not spark) scheme (folder hierarchy) might be leading to this skewed partitions problem, where some s3 folders (techinically 'prefixes') have just one file while other has thousands, so I transformed the input to a flattened directory structure using hashcodewhere each folder has just one file:我认为输入分区（s3 分区不是 spark）方案（文件夹层次结构）可能会导致这种倾斜的分区问题，其中一些 s3 文件夹（技术上是“前缀”）只有一个文件，而其他文件夹有数千个，所以我转换了输入使用 hashcode 到一个扁平的目录结构，其中每个文件夹只有一个文件：

Earlier:早些时候：

/parent1

             /file1
             /file2
             .
             .
             /file1000

/parent2/
           /file1

Now:现在：

   hashcode=FEFRE#$#$$#FE/parent1/file1
   hashcode=#$#$#Cdvfvf@#/parent1/file1

But it didnt have any effect.但它没有任何效果。

I have tried with really large clusters too - thinking that even if there is input skew - that much memory should be able to handle the larger files.我也尝试过使用非常大的集群 - 认为即使存在输入偏差 - memory 应该能够处理更大的文件。 But I still get into the straggling tasks.但我仍然陷入了零散的任务。

When I check the number of files (each file becomes a row in dataframe due to its nested - unsplittable nature) assigned to each partition - I see number of files assigned to be between 2 to 32. Is it because spark picks up the files in partitions based on spark.sql.files.maxPartitionBytes - and probably its assigning only two files where the file size is huge, and much more files to single partition when the filesize is less?当我检查分配给每个分区的文件数（每个文件由于其嵌套 - 不可分割的性质而成为 dataframe 中的一行）时 - 我看到分配给每个分区的文件数在 2 到 32 之间。是因为 spark 在基于spark.sql.files.maxPartitionBytes的分区 - 可能它只分配两个文件很大的文件，当文件大小较小时将更多文件分配给单个分区？

Any recommendations to make the job work properly, and distribute the tasks uniformly - given size of input files is something that can not be changed due to nature of input files.使工作正常工作并统一分配任务的任何建议 - 由于输入文件的性质，输入文件的给定大小是无法更改的。

Answer 1

Great job flattening the files to increase read speed.伟大的工作扁平化文件以提高读取速度。 Prefixes as you seem to understand are related to buckets and bucket read speed is related to the number of files under each prefix and their size.您似乎理解的前缀与存储桶有关，存储桶读取速度与每个前缀下的文件数量及其大小有关。 The approach you took will up reading faster than you original strategy.您采用的方法将比您原来的策略更快地阅读。 It will not help you with skew of the data itself.它不会帮助您处理数据本身的偏差。

One thing you might consider is that your raw data and working data do not need to be the same set of files.您可能会考虑的一件事是您的原始数据和工作数据不需要是同一组文件。 There is a strategy for landing data and then pre-processing it for performance.有一种登陆数据然后对其进行预处理以提高性能的策略。

That is to say keep the raw data in the format that you have now, then make a copy of the data in a more convenient format for regulatory queries.也就是说，将原始数据保留为您现在拥有的格式，然后以更方便的格式复制数据以供监管查询。 (Parquet is the best choice for working with S3). （Parquet 是使用 S3 的最佳选择）。

Land data a 'landing zone'土地数据是一个“着陆区”
As needed process the data stored in the landing zone, into a convenient splittable format for querying.根据需要将存储在着陆区的数据处理成方便的可拆分格式以供查询。 ('pre-processed folder' ) ('预处理文件夹' )
Once your raw data is processed move it to a 'processed folder'.处理完原始数据后，将其移至“已处理文件夹”。 (Use Your existing flat folder structure.) This processing table is important should you need to rebuild the table or make changes to the table format. （使用您现有的平面文件夹结构。）如果您需要重建表格或更改表格格式，此处理表格很重要。
Create a view that is a union of data in the 'landing zone' and the 'pre-pocesssed' folder.创建一个view ，该视图是“着陆区”和“预处理”文件夹中数据的联合。 This gives you a performant table with up to date data.这为您提供了一个包含最新数据的性能表。

If you are using the latest S3 you should get consistent reads, that allow you to ensure you are querying on all the data.如果您使用的是最新的 S3，您应该获得一致的读取，这使您可以确保查询所有数据。 In days of the past S3 was eventually consistent meaning you might miss some data while it's in transit, this issue is supposedly fixed in the recent version of S3.在过去的日子里，S3 最终是一致的，这意味着您可能会在传输过程中丢失一些数据，这个问题应该在最新版本的 S3 中得到修复。 Run this 'processing' as often as needed and you should have a performant table to run large queries on.根据需要经常运行此“处理”，您应该有一个高性能表来运行大型查询。

S3 was designed as a long term cheap storage. S3 被设计为长期廉价存储。 It's not made to perform quickly, but they've been trying to make it better over time.它不是为了快速执行而设计的，但他们一直在努力让它随着时间的推移变得更好。

Spark - 倾斜输入 dataframe

问题描述

What I tried:我尝试了什么：

Earlier:早些时候：

Now:现在：

1 个解决方案

解决方案1
1 2022-12-30 15:56:43

Spark - 倾斜输入 dataframe

问题描述

What I tried:我尝试了什么：

Earlier:早些时候：

Now:现在：

1 个解决方案

解决方案1 1 2022-12-30 15:56:43

解决方案1
1 2022-12-30 15:56:43