简体   繁体   中英

Amazon S3 Store Millions of Files

I am trying to find the most cost effective way of doing this, will appreciate any help:

  • I have 100s of millions of files. Each file is under 1MB each (usually 100KB or so)
  • In total this is over 5 TB of data - as of now, and this will grow weekly
  • I cannot merge/concatenate the files together. The files must be stored as is
  • Query and download requirements are basic. Around 1 Million files to be selected and downloaded per month
  • I am not worried about S3 storage or Data Retrieval or Data Scan cost.

My question is when I upload 100s of million files, does this count as one PUT request per file (meaning one per object)? If so, just the cost to upload the data will be massive. If I upload a directory with a million files, is that one PUT request?

What if I zip the 100 million files on prem, then upload the zip, and use lambda to unzip. Would that count as one PUT request?

Any advise?

You say that you have "100s of millions of files", so I shall assume you have 400 million objects, making 40TB of storage. Please adjust accordingly. I have shown my calculations so that people can help identify my errors.

Initial upload

PUT requests in Amazon S3 are charged at $0.005 per 1,000 requests . Therefore, 400 million PUTs would cost $2000 . ( .005*400m/1000 )

This cost cannot be avoided if you wish to create them all as individual objects.

Future uploads would be the same cost at $5 per million .

Storage

Standard storage costs $0.023 per GB , so storing 400 million 100KB objects would cost $920/month . ( .023*400m*100/1m )

Storage costs can be reduced by using lower-cost Storage Classes .

Access

GET requests are $0.0004 per 1,000 requests , so downloading 1 million objects each month would cost 40c/month . ( .0004*1m/1000 )

If the data is being transferred to the Inte.net, Data Transfer costs of $0.09 per GB would apply. The Data Transfer cost of downloading 1 million 100KB objects would be $9/month . ( .09*1m*100/1m )

Analysis

You seem to be most fearful of the initial cost of uploading 100s of millions of objects at a cost of $5 per million objects.

However, storage will also be high, and the cost of $2.30/month per million objects ($920/month for 400m objects). That ongoing cost is likely to dwarf the cost of initial uploads.

Some alternatives would be:

  • Store the data on-premises (disk storage is $100/4TB, so 400m files would require $1000 of disks, but you would want extra drives for redundancy), or
  • Store the data in a database : There are no 'PUT' costs for databases, but you would need to pay for running the database. This might work out a lower cost. or
  • Combine the data in the files (which you say you do not wish to do), but in a way that can be easily split-apart. For example, marking records by an identifier for easy extractions. or
  • Use a different storage service , such as Digital Ocean , who do not appear to have a 'PUT' cost.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM