简体   繁体   English

Spark:如何使用子集日期读取多个s3文件

[英]Spark: How to read multiple s3 files using subset date

I am using spark 2.1 on EMR, my files are stored by date: 我在EMR上使用spark 2.1,我的文件按日期存储:

s3://test/2016/07/01/file.gz
s3://test/2016/07/02/file.gz
...
...
s3://test/2017/05/15/file.gz

I would like to read only last month of data. 我只想读取上个月的数据。 I tried with those two solutions but it didn't match with my needs: 我尝试了这两种解决方案,但与我的需求不匹配:

How to read multiple gzipped files from S3 into a single RDD 如何从S3读取多个压缩文件到单个RDD中

pyspark select subset of files using regex/glob from s3 pyspark使用s3中的regex / glob选择文件子集

Here is my script: 这是我的脚本:

from_dt = '2017/01/01'
to_dt = '2017/01/31'
datetime_object = datetime.datetime.strptime(from_dt, '%Y/%m/%d')
datetime_object_2 = datetime.datetime.strptime(to_dt, '%Y/%m/%d')

from datetime import date, timedelta

d1 = datetime_object  # start date
d2 = datetime_object_2  # end date

delta = d2 - d1         # timedelta
date_range = []

for i in range(delta.days + 1):
    a = (d1 + timedelta(days=i))
    a = a.strftime('%Y/%m/%d').replace("-","/")
    date_range.append(a)

d = str(date_range).replace('[','{').replace(']','}').replace('\'',"")

print d
'{2017/01/01, 2017/01/02, 2017/01/03, 2017/01/04, 2017/01/05, 2017/01/06, 2017/01/07, 2017/01/08, 2017/01/09, 2017/01/10, 2017/01/11, 2017/01/12, 2017/01/13, 2017/01/14, 2017/01/15, 2017/01/16, 2017/01/17, 2017/01/18, 2017/01/19, 2017/01/20, 2017/01/21, 2017/01/22, 2017/01/23, 2017/01/24, 2017/01/25, 2017/01/26, 2017/01/27, 2017/01/28, 2017/01/29, 2017/01/30, 2017/01/31}'

DF1 = spark.read.csv("s3://test/"+d+"/*", sep='|', header='true')

DF1.count()
output : 7000

When i do the same thing putting manually the path i didn't get the same result: 当我做同样的事情手动放置路径时,我没有得到相同的结果:

DF2 = spark.read.csv("s3://test/2017/01/*/*", sep='|', header='true')

DF2.count()
output : 230000

I found the error: the range of date must be a string without space between dates. 我发现了错误:日期范围必须是日期之间没有空格的字符串。

d = str(date_range).replace('[','{').replace(']','}').replace('\'',"").replace('\'',"")
print d
output: '{2017/01/01,2017/01/02,2017/01/03,2017/01/04,2017/01/05,2017/01/06,2017/01/07,2017/01/08,2017/01/09,2017/01/10,2017/01/11,2017/01/12,2017/01/13,2017/01/14,2017/01/15,2017/01/16,2017/01/17,2017/01/18,2017/01/19, 2017/01/20,2017/01/21,2017/01/22,2017/01/23,2017/01/24,2017/01/25,2017/01/26, 2017/01/27,2017/01/28,2017/01/29,2017/01/30,2017/01/31}'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spark - 如何从 S3 读取多个文件名为 Json 的文件 - Spark - How to Read Multiple Multiple Json Files With Filename From S3 如何使用 python 中的 spark dataframe 从 AWS S3 读取镶木地板文件(pyspark) - How to read parquet files from AWS S3 using spark dataframe in python (pyspark) 我可以将多个文件从 S3 读入 Spark 数据帧,并忽略不存在的文件吗? - Can I read multiple files into a Spark Dataframe from S3, passing over nonexistent ones? 如何在python中更快地从s3读取和处理多个文件? - How to read and process multiple files from s3 faster in python? 如何使用 pandas 从 s3 读取文件? - How can I read files from s3 using pandas? 如何使用 Xarray 读取 lambda 中的 S3 文件? - how to read S3 files in lambda using Xarray? 如何使用 Glue 读取多个 S3 存储桶? - How can I read multiple S3 buckets using Glue? 在 Spark 中并行读取来自不同 aws S3 的多个文件 - Reading multiple files from different aws S3 in Spark parallelly 使用 pyspark 将镶木地板文件(在 aws s3 中)存储到火花 dataframe - store parquet files (in aws s3) into a spark dataframe using pyspark 如何从S3读取镶木地板数据以激发数据帧Python? - How to read parquet data from S3 to spark dataframe Python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM