[英]Would S3 Select speed up Spark analyses on Parquet files?
You can use S3 Select with Spark on Amazon EMR and with Databricks , but only for CSV and JSON files.您可以将S3 Select 与 Amazon EMR 上的 Spark和Databricks 一起使用,但仅限于 CSV 和 JSON 文件。 I am guessing that S3 Select isn't offered for columnar file formats because it wouldn't help that much.
我猜 S3 Select 没有为柱状文件格式提供,因为它不会有太大帮助。
Let's say we have a data lake of people with first_name
, last_name
and country
columns.假设我们有一个包含
first_name
、 last_name
和country
列的人员的数据湖。
If the data is stored as CSV files and you run a query like peopleDF.select("first_name").distinct().count()
, then S3 will transfer all the data for all the columns to the ec2 cluster to run the computation.如果数据存储为 CSV 文件并且您运行类似
peopleDF.select("first_name").distinct().count()
的查询,则 S3 会将所有列的所有数据传输到 ec2 集群以运行计算. This is really inefficient because we don't need all the last_name
and country
data to run this query.这真的很低效,因为我们不需要所有的
last_name
和country
数据来运行这个查询。
If the data is stored as CSV files and you run the query with S3 select, then S3 will only transfer the data in the first_name
column to run the query.如果数据存储为 CSV 文件并且您使用 S3 select 运行查询,则 S3 将仅传输
first_name
列中的数据以运行查询。
spark
.read
.format("s3select")
.schema(...)
.options(...)
.load("s3://bucket/filename")
.select("first_name")
.distinct()
.count()
If the data is stored in a Parquet data lake and peopleDF.select("first_name").distinct().count()
is run, then S3 will only transfer the data in the first_name
column to the ec2 cluster.如果数据存储在 Parquet 数据湖中并
peopleDF.select("first_name").distinct().count()
,则 S3 只会将first_name
列中的数据传输到 ec2 集群。 Parquet is a columnar file format and this is one of the main advantages. Parquet 是一种列式文件格式,这是主要优点之一。
So based on my understanding, S3 Select wouldn't help speed up an analysis on a Parquet data lake because columnar file formats offer the S3 Select optimization out of the box.因此,根据我的理解,S3 Select 无助于加快对 Parquet 数据湖的分析,因为列式文件格式提供开箱即用的 S3 Select 优化。
I am not sure because a coworker is certain I am wrong and because S3 Select supports the Parquet file format .我不确定,因为一位同事确定我错了,而且因为S3 Select 支持 Parquet 文件格式。 Can you please confirm that columnar file formats provide the main optimization offered by S3 Select?
您能否确认分栏文件格式提供了 S3 Select 提供的主要优化?
This is an interesting question.这是个有趣的问题。 I don't have any real numbers, though I have done the S3 select binding code in the hadoop-aws module.
我没有任何实数,尽管我已经在 hadoop-aws 模块中完成了 S3 选择绑定代码。 Amazon EMR have some values, as do databricks.
Amazon EMR 和数据块一样具有一些价值。
For CSV IO Yes, S3 Select will speedup given aggressive filtering of source data, eg many GB of data but not much back.对于 CSV IO 是的,S3 Select 将在对源数据进行积极过滤的情况下加速,例如许多 GB 的数据,但返回的数据不多。 Why?
为什么? although the read is slower, you save on the limited bandwidth to your VM.
尽管读取速度较慢,但您可以节省 VM 的有限带宽。
For Parquet though, the workers split up a large file into parts and schedule the work across them (Assuming a splittable compression format like snappy is used), so > 1 worker can work on the same file.但是对于 Parquet,工作人员将一个大文件拆分成多个部分并安排它们之间的工作(假设使用像 snappy 这样的可拆分压缩格式),因此 > 1 个工作人员可以处理同一个文件。 And they only read a fraction of the data (==bandwidth benefits less), But they do seek around in that file (==need to optimise seek policy else cost of aborting and reopening HTTP connections)
而且他们只读取了一小部分数据(==带宽收益较少),但他们确实在该文件中四处寻找(==需要优化寻找策略,否则中止和重新打开 HTTP 连接的成本)
I'm not convinced that Parquet reads in the S3 cluster can beat a spark cluster if there's enough capacity in the cluster and you've tuned your s3 client settings (for s3a this means: seek policy, thread pool size, http pool size) for performance too.如果集群中有足够的容量并且您已经调整了 s3 客户端设置(对于 s3a 这意味着:查找策略、线程池大小、http 池大小),我不相信 S3 集群中的 Parquet 读取可以击败 spark 集群也是为了表现。
Like I said though: I'm not sure.就像我说的:我不确定。 Numbers are welcome.
欢迎提供数字。
Came across this spark package for s3 select on parquet [1]在 parquet [1] 上遇到了用于 s3 select 的这个 spark 包
[1] https://github.com/minio/spark-select [1] https://github.com/minio/spark-select
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.