简体   繁体   English

我可以将多个文件从 S3 读入 Spark 数据帧,并忽略不存在的文件吗?

[英]Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones?

I would like to read multiple parquet files into a dataframe from S3.我想从 S3 将多个镶木地板文件读入数据帧。 Currently, I'm using the following method to do this:目前,我使用以下方法来做到这一点:

files = ['s3a://dev/2017/01/03/data.parquet',
         's3a://dev/2017/01/02/data.parquet']
df = session.read.parquet(*files)

This works if all of the files exist on S3, but I would like to ask for a list of files to be loaded into a dataframe without breaking when some of the files in the list don't exist.如果所有文件都存在于 S3 上,这有效,但我想要求将文件列表加载到数据帧中,而不会在列表中的某些文件不存在时中断。 In other words, I would like for sparkSql to load as many of the files as it finds into the dataframe, and return this result without complaining.换句话说,我希望 sparkSql 将它找到的尽可能多的文件加载到数据框中,并返回此结果而不会抱怨。 Is this possible?这可能吗?

Yes, it's possible if you change method of specifying input to hadoop glob pattern, for example:是的,如果您将指定输入的方法更改为 hadoop glob 模式,则是可能的,例如:

files = 's3a://dev/2017/01/{02,03}/data.parquet'
df = session.read.parquet(files)

You can read more on patterns in Hadoop javadoc .您可以在Hadoop javadoc 中阅读有关模式的更多信息。

But, in my opinion this isn't elegant way of working with data partitioned by time (by day in your case).但是,在我看来,这不是处理按时间分区的数据(在您的情况下是按天)的优雅方式。 If you are able to rename directories like this:如果您能够像这样重命名目录:

  • s3a://dev/2017/01/03/data.parquet --> s3a://dev/day=2017-01-03/data.parquet s3a://dev/2017/01/03/data.parquet --> s3a://dev/day=2017-01-03/data.parquet
  • s3a://dev/2017/01/02/data.parquet --> s3a://dev/day=2017-01-02/data.parquet s3a://dev/2017/01/02/data.parquet --> s3a://dev/day=2017-01-02/data.parquet

then you can take advantage of spark partitioning schema and read data by:那么您可以利用spark 分区模式并通过以下方式读取数据:

session.read.parquet('s3a://dev/') \
    .where(col('day').between('2017-01-02', '2017-01-03')

This way will omit empty/non-existing directories as well.这种方式也将省略空/不存在的目录。 Additionall column day will appear in your dataframe (it will be string in spark <2.1.0 and datetime in spark >= 2.1.0), so you will know in which directory each record exists.额外的列day将出现在您的数据框中(它将是 spark <2.1.0 中的字符串和 spark >= 2.1.0 中的日期时间),因此您将知道每条记录存在于哪个目录中。

Can I observe that as glob-pattern matching includes a full recursive tree-walk and pattern match of the paths, it is an absolute performance killer against object stores, especially S3.我是否可以观察到,由于全局模式匹配包括路径的完全递归树遍历和模式匹配,因此它是针对对象存储(尤其是 S3)的绝对性能杀手。 There's a special shortcut in spark to recognise when your path doesn't have any glob characters in, in which case it makes a more efficient choice. spark中有一个特殊的快捷方式可以识别您的路径何时没有任何glob字符,在这种情况下,它是一个更有效的选择。

Similarly, a very deep partitioning tree,as in that year/month/day layout, means many directories scanned, at a cost of hundreds of millis (or worse) per directory.类似地,在年/月/日布局中非常深的分区树意味着扫描许多目录,每个目录的成本为数百毫秒(或更糟)。

The layout suggested by Mariusz should be much more efficient, as it is a flatter directory tree —switching to it should have a bigger impact on performance on object stores than real filesystems. Mariusz 建议的布局应该更有效率,因为它是一个更扁平的目录树——切换到它应该比真正的文件系统对对象存储的性能产生更大的影响。

A solution using union使用union的解决方案

files = ['s3a://dev/2017/01/03/data.parquet',
         's3a://dev/2017/01/02/data.parquet']

for i, file in enumerate(files):
    act_df = spark.read.parquet(file)   
    if i == 0:
        df = act_df
    else:
        df = df.union(act_df)

An advantage is that it can be done regardless any pattern.一个优点是它可以在任何模式下完成。

import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.transforms import *
from awsglue.dynamicframe import DynamicFrame
from awsglue.utils import getResolvedOptions
from awsglue.job import Job

import boto3


sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)


inputDyf = lueContext.create_dynamic_frame.from_options(connection_type="parquet", connection_options={'paths': ["s3://dev-test-laxman-new-bucket/"]})

I am able to read multiple (2) parquet file from s3://dev-test-laxman-new-bucket/ and write in csv files.我能够从s3://dev-test-laxman-new-bucket/读取多个 (2) 镶木地板文件并写入 csv 文件。 在此处输入图片说明

As you can see i have 2 parqet file in the my bucket :如您所见,我的存储桶中有 2 个 parqet 文件:

Hope it will be helpful to others.希望对其他人有帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM