Access publicly available data from S3 bucket link

Question

I am trying to access the data for reproducing the Redshift benchmarks on this page. If you scroll down to Run This Benchmark Yourself section the author says the data can be accessed at the following S3 bucket replacing the items in [] with the format and data size that we are interested in:

s3n://big-data-benchmark/pavlo/[text|text-deflate|sequence|sequence-snappy]/[suffix]

Based on the above, I tried downloading the data using a link this way:

http://s3.amazonaws.com/big-data-benchmark/pavlo/text/tiny/

But it is not working. Can someone provide guidance on how to get these datasets?

Answer 1

If I remove the "n" from s3n:// I can list your directory:

    $ aws s3 ls s3://big-data-benchmark/pavlo/text/tiny/
    PRE crawl/
    PRE rankings/
    PRE uservisits/
    2013-05-03 10:13:42          0 crawl_$folder$
    2013-05-09 07:23:17          0 rankings_$folder$
    2013-05-09 07:22:36          0 uservisits_$folder$

from there I can get individual paths, eg

s3://big-data-benchmark/pavlo/text/tiny/crawl/part-00000

whose https URL would be:

https://s3.amazonaws.com/big-data-benchmark/pavlo/text/tiny/crawl/part-00000

Good luck!

Access publicly available data from S3 bucket link

Question

1 answers

solution1
2 ACCPTED 2015-09-29 01:04:37

Access publicly available data from S3 bucket link

Question

1 answers

solution1 2 ACCPTED 2015-09-29 01:04:37

solution1
2 ACCPTED 2015-09-29 01:04:37