[英]Spark - Skewed Input dataframe
i am working with a heavily nested non-splittable json format input dataset housed in S3.我正在使用位于 S3 中的高度嵌套的不可拆分 json 格式输入数据集。 The files can vary a lot in their sizes - minimum is 10kb while other is 300 MB.
这些文件的大小可能有很大差异 - 最小为 10kb,而其他为 300 MB。
When reading the file using the below code, and just doing a simple repartition to desired number of partitions leads to straggling tasks - most tasks finish within seconds but one would last for couple hours and then runs into memory issues (heartbeat missing/ heap space etc.) I repartition in an attempt to randomize the partition to file mapping since spark may be reading files in sequence and files within same directory tend to have same nature -all large/all small etc.当使用下面的代码读取文件时,只是对所需数量的分区进行简单的重新分区会导致任务混乱——大多数任务在几秒钟内完成,但一个会持续几个小时,然后遇到 memory 问题(心跳丢失/堆空间等) .) 我重新分区以尝试将分区随机化为文件映射,因为 spark 可能正在按顺序读取文件,并且同一目录中的文件往往具有相同的性质 - 全大/全小等。
df = spark.read.json('s3://my/parent/directory')
df.repartition(396)
# Settings (few):
default parallelism = 396
total number of cores = 400
/parent1
/file1
/file2
.
.
/file1000
/parent2/
/file1
hashcode=FEFRE#$#$$#FE/parent1/file1
hashcode=#$#$#Cdvfvf@#/parent1/file1
But it didnt have any effect.但它没有任何效果。
When I check the number of files (each file becomes a row in dataframe due to its nested - unsplittable nature) assigned to each partition - I see number of files assigned to be between 2 to 32. Is it because spark picks up the files in partitions based on spark.sql.files.maxPartitionBytes
- and probably its assigning only two files where the file size is huge, and much more files to single partition when the filesize is less?当我检查分配给每个分区的文件数(每个文件由于其嵌套 - 不可分割的性质而成为 dataframe 中的一行)时 - 我看到分配给每个分区的文件数在 2 到 32 之间。是因为 spark 在基于
spark.sql.files.maxPartitionBytes
的分区 - 可能它只分配两个文件很大的文件,当文件大小较小时将更多文件分配给单个分区?
Any recommendations to make the job work properly, and distribute the tasks uniformly - given size of input files is something that can not be changed due to nature of input files.使工作正常工作并统一分配任务的任何建议 - 由于输入文件的性质,输入文件的给定大小是无法更改的。
Great job flattening the files to increase read speed.伟大的工作扁平化文件以提高读取速度。 Prefixes as you seem to understand are related to buckets and bucket read speed is related to the number of files under each prefix and their size.
您似乎理解的前缀与存储桶有关,存储桶读取速度与每个前缀下的文件数量及其大小有关。 The approach you took will up reading faster than you original strategy.
您采用的方法将比您原来的策略更快地阅读。 It will not help you with skew of the data itself.
它不会帮助您处理数据本身的偏差。
One thing you might consider is that your raw data and working data do not need to be the same set of files.您可能会考虑的一件事是您的原始数据和工作数据不需要是同一组文件。 There is a strategy for landing data and then pre-processing it for performance.
有一种登陆数据然后对其进行预处理以提高性能的策略。
That is to say keep the raw data in the format that you have now, then make a copy of the data in a more convenient format for regulatory queries.也就是说,将原始数据保留为您现在拥有的格式,然后以更方便的格式复制数据以供监管查询。 (Parquet is the best choice for working with S3).
(Parquet 是使用 S3 的最佳选择)。
view
that is a union of data in the 'landing zone' and the 'pre-pocesssed' folder.view
,该视图是“着陆区”和“预处理”文件夹中数据的联合。 This gives you a performant table with up to date data. If you are using the latest S3 you should get consistent reads, that allow you to ensure you are querying on all the data.如果您使用的是最新的 S3,您应该获得一致的读取,这使您可以确保查询所有数据。 In days of the past S3 was eventually consistent meaning you might miss some data while it's in transit, this issue is supposedly fixed in the recent version of S3.
在过去的日子里,S3 最终是一致的,这意味着您可能会在传输过程中丢失一些数据,这个问题应该在最新版本的 S3 中得到修复。 Run this 'processing' as often as needed and you should have a performant table to run large queries on.
根据需要经常运行此“处理”,您应该有一个高性能表来运行大型查询。
S3 was designed as a long term cheap storage. S3 被设计为长期廉价存储。 It's not made to perform quickly, but they've been trying to make it better over time.
它不是为了快速执行而设计的,但他们一直在努力让它随着时间的推移变得更好。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.