简体   繁体   中英

Access publicly available data from S3 bucket link

I am trying to access the data for reproducing the Redshift benchmarks on this page. If you scroll down to Run This Benchmark Yourself section the author says the data can be accessed at the following S3 bucket replacing the items in [] with the format and data size that we are interested in:

s3n://big-data-benchmark/pavlo/[text|text-deflate|sequence|sequence-snappy]/[suffix]

Based on the above, I tried downloading the data using a link this way:

http://s3.amazonaws.com/big-data-benchmark/pavlo/text/tiny/

But it is not working. Can someone provide guidance on how to get these datasets?

If I remove the "n" from s3n:// I can list your directory:

    $ aws s3 ls s3://big-data-benchmark/pavlo/text/tiny/
    PRE crawl/
    PRE rankings/
    PRE uservisits/
    2013-05-03 10:13:42          0 crawl_$folder$
    2013-05-09 07:23:17          0 rankings_$folder$
    2013-05-09 07:22:36          0 uservisits_$folder$

from there I can get individual paths, eg

s3://big-data-benchmark/pavlo/text/tiny/crawl/part-00000

whose https URL would be:

https://s3.amazonaws.com/big-data-benchmark/pavlo/text/tiny/crawl/part-00000

Good luck!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM