dask s3 access on ec2 workers

Question

I try to read a lot of csv files from s3 with workers running on ec2 instances with the right IAM roles (I can read from the same buckets from other scripts). When I try to read my own data from a private bucket with this command:

client = Client('scheduler-on-ec2')
df = read_csv('s3://xyz/*csv.gz',
              compression='gzip',
              blocksize=None,
              #storage_options={'key': '', 'secret': ''}
             )
df.size.compute()

Data is look like read locally (by local python interpreter, not the workers), then is sent to workers (or scheduler?) by the local interpreter, and when the workers receive the chunks, they run the compute and return the results. Same with or without passing the key and the secret via storage_options .

When I read from a public s3 bucket (nyc taxi data), with storage_options={'anon': True} , everything looks okay.

What do you think the problem is and what should I reconfigure change to get the workers read directly from s3?

s3fs is installed correctly, and these are the supported filesystems according to dask:

>>>> dask.bytes.core._filesystems
{'file': dask.bytes.local.LocalFileSystem,
 's3': dask.bytes.s3.DaskS3FileSystem}

Update

After monitoring network interfaces, it looks like something is uploaded from the interpreter to the scheduler. The more partitions there are in the dataframe (or bag), the bigger the data is sent to scheduler. I thought it could be the computation graph, but it is really big. For 12 files, it is 2-3MB, for 30 files it is 20MB and for larger data, (150 files) it just takes too long to send it to the scheduler and I didn't wait it. What else is being sent to the scheduler that can take up this amount of data?

Answer 1

When you call dd.read_csv('s3://...') the local machine will read a little bit of the data in order to guess column names, dtypes, etc.. However the workers will read in the majority of the data directly.

When using the distributed scheduler, Dask does not load data in the local machine and then pump it out to the workers. As you suggest, this would be inefficient.

You might want to look at the web diagnostic pages to get more information about what is taking time.

dask s3 access on ec2 workers

Question

1 answers

solution1
0 2017-03-02 14:15:38

dask s3 access on ec2 workers

Question

1 answers

solution1 0 2017-03-02 14:15:38

solution1
0 2017-03-02 14:15:38