简体   繁体   English

S3 Select 会加速 Spark 对 Parquet 文件的分析吗?

[英]Would S3 Select speed up Spark analyses on Parquet files?

You can use S3 Select with Spark on Amazon EMR and with Databricks , but only for CSV and JSON files.您可以将S3 Select 与 Amazon EMR 上的 SparkDatabricks 一起使用,但仅限于 CSV 和 JSON 文件。 I am guessing that S3 Select isn't offered for columnar file formats because it wouldn't help that much.我猜 S3 Select 没有为柱状文件格式提供,因为它不会有太大帮助。

Let's say we have a data lake of people with first_name , last_name and country columns.假设我们有一个包含first_namelast_namecountry列的人员的数据湖。

If the data is stored as CSV files and you run a query like peopleDF.select("first_name").distinct().count() , then S3 will transfer all the data for all the columns to the ec2 cluster to run the computation.如果数据存储为 CSV 文件并且您运行类似peopleDF.select("first_name").distinct().count()的查询,则 S3 会将所有列的所有数据传输到 ec2 集群以运行计算. This is really inefficient because we don't need all the last_name and country data to run this query.这真的很低效,因为我们不需要所有的last_namecountry数据来运行这个查询。

If the data is stored as CSV files and you run the query with S3 select, then S3 will only transfer the data in the first_name column to run the query.如果数据存储为 CSV 文件并且您使用 S3 select 运行查询,则 S3 将仅传输first_name列中的数据以运行查询。

spark
  .read
  .format("s3select")
  .schema(...)
  .options(...)
  .load("s3://bucket/filename")
  .select("first_name")
  .distinct()
  .count()

If the data is stored in a Parquet data lake and peopleDF.select("first_name").distinct().count() is run, then S3 will only transfer the data in the first_name column to the ec2 cluster.如果数据存储在 Parquet 数据湖中并peopleDF.select("first_name").distinct().count() ,则 S3 只会将first_name列中的数据传输到 ec2 集群。 Parquet is a columnar file format and this is one of the main advantages. Parquet 是一种列式文件格式,这是主要优点之一。

So based on my understanding, S3 Select wouldn't help speed up an analysis on a Parquet data lake because columnar file formats offer the S3 Select optimization out of the box.因此,根据我的理解,S3 Select 无助于加快对 Parquet 数据湖的分析,因为列式文件格式提供开箱即用的 S3 Select 优化。

I am not sure because a coworker is certain I am wrong and because S3 Select supports the Parquet file format .我不确定,因为一位同事确定我错了,而且因为S3 Select 支持 Parquet 文件格式 Can you please confirm that columnar file formats provide the main optimization offered by S3 Select?您能否确认分栏文件格式提供了 S3 Select 提供的主要优化?

This is an interesting question.这是个有趣的问题。 I don't have any real numbers, though I have done the S3 select binding code in the hadoop-aws module.我没有任何实数,尽管我已经在 hadoop-aws 模块中完成了 S3 选择绑定代码。 Amazon EMR have some values, as do databricks. Amazon EMR 和数据块一样具有一些价值。

For CSV IO Yes, S3 Select will speedup given aggressive filtering of source data, eg many GB of data but not much back.对于 CSV IO 是的,S3 Select 将在对源数据进行积极过滤的情况下加速,例如许多 GB 的数据,但返回的数据不多。 Why?为什么? although the read is slower, you save on the limited bandwidth to your VM.尽管读取速度较慢,但您可以节省 VM 的有限带宽。

For Parquet though, the workers split up a large file into parts and schedule the work across them (Assuming a splittable compression format like snappy is used), so > 1 worker can work on the same file.但是对于 Parquet,工作人员将一个大文件拆分成多个部分并安排它们之间的工作(假设使用像 snappy 这样的可拆分压缩格式),因此 > 1 个工作人员可以处理同一个文件。 And they only read a fraction of the data (==bandwidth benefits less), But they do seek around in that file (==need to optimise seek policy else cost of aborting and reopening HTTP connections)而且他们只读取了一小部分数据(==带宽收益较少),但他们确实在该文件中四处寻找(==需要优化寻找策略,否则中止和重新打开 HTTP 连接的成本)

I'm not convinced that Parquet reads in the S3 cluster can beat a spark cluster if there's enough capacity in the cluster and you've tuned your s3 client settings (for s3a this means: seek policy, thread pool size, http pool size) for performance too.如果集群中有足够的容量并且您已经调整了 s3 客户端设置(对于 s3a 这意味着:查找策略、线程池大小、http 池大小),我不相信 S3 集群中的 Parquet 读取可以击败 spark 集群也是为了表现。

Like I said though: I'm not sure.就像我说的:我不确定。 Numbers are welcome.欢迎提供数字。

Came across this spark package for s3 select on parquet [1]在 parquet [1] 上遇到了用于 s3 select 的这个 spark 包

[1] https://github.com/minio/spark-select [1] https://github.com/minio/spark-select

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将 parquet 文件写入 S3 存储桶后 Apache Spark 挂起 - Apache Spark hangs after writing parquet file to S3 bucket 使用 AWS Glue 将 AWS Redshift 转换为 S3 Parquet 文件 - AWS Redshift to S3 Parquet Files Using AWS Glue AWS Glue - 将所有 S3 JSON 文件组合成具有大小限制的 S3 Parquet 文件 - AWS Glue - Combining all S3 JSON files into S3 Parquet files with a size limit 尝试将镶木地板文件写入 S3 存储桶时出现 PySpark SparkSession 错误:org.apache.spark.SparkException:写入行时任务失败 - PySpark SparkSession error when trying to write parquet files to S3 bucket: org.apache.spark.SparkException: Task failed while writing rows 解决从数据帧导出的 S3 上的红移表和镶木地板文件之间的数据类型不匹配 - Resolving datatype missmatch between redshift tables and parquet files on S3 exported from dataframes 如何从 S3 Parquet 文件加速 Redshift COPY 以获取 91Gb 的数据? - How can I speedup Redshift COPY from S3 Parquet files for 91Gb of data? 为什么 AWS GLUE 在 S3 中创建的镶木地板文件与 MYSQL 表中的总记录数相同? - Why parquet files created by AWS GLUE in S3 is the same amount of total records in MYSQL table? 将镶木地板文件从 S3 加载到 DynamoDB - Loading parquet file from S3 to DynamoDB 列出 S3 上的文件 - List files on S3 Ceph s3 桶空间未释放 - Ceph s3 bucket space not freeing up
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM