How do I generate and load multiple s3 file path in scala so that I can use :
sqlContext.read.json ("s3://..../*/*/*")
I know I can use wildcards to read multiple files but is there any way so that I can generate the path ? For example my fIle structure looks like this: BucketName/year/month/day/files
s3://testBucket/2016/10/16/part00000
These files are all jsons. The issue is I need to load just spacific duration of files, for eg. Say 16 days then I need to loado files for start day ( oct 16) : oct 1 to 16.
With 28 day duration for same start day I would like to read from Sep 18
Can some tell me any ways to do this ?
You can take a look at this answer , You can specify whole directories
, use wildcards
and even CSV of directories and wildcards
. Eg:
sc.textFile("/my/dir1,/my/paths/part-00[0-5]*,/another/dir,/a/specific/file")
Or you can use AWS API
to get the list of files locations
and read those files using spark .
You can look into this answer to AWS S3 file search.
您可以生成逗号分隔的路径列表: sqlContext.read.json (s3://testBucket/2016/10/16/ ,s3://testBucket/2016/10/15/ ,...);
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.