[英]Data partitioning in s3
We have our data in relational database in single table with columns id and date as this. 我们将关系数据库中的数据放在单个表中,列id和日期为此。
productid date value1 value2
1 2005-10-26 24 27
1 2005-10-27 22 28
2 2005-10-26 12 18
Trying to load them to s3 as parquet and create metadata in hive to query them using athena and redshift. 尝试将它们加载到s3作为镶木地板并在配置单元中创建元数据以使用athena和redshift查询它们。 Our most frequent queries will be filtering on product id, day, month and year.
我们最常见的查询将过滤产品ID,日,月和年。 So trying to load the data partitions in a way to have better query performance.
因此,尝试以一种具有更好查询性能的方式加载数据分区。
From what i understood, I can create the partitions like this 根据我的理解,我可以像这样创建分区
s3://my-bucket/my-dataset/dt=2017-07-01/
...
s3://my-bucket/my-dataset/dt=2017-07-09/
s3://my-bucket/my-dataset/dt=2017-07-10/
or like this, 或者像这样,
s3://mybucket/year=2017/month=06/day=01/
s3://mybucket/year=2017/month=06/day=02/
...
s3://mybucket/year=2017/month=08/day=31/
We partitioned like this, 我们像这样划分,
s3://bucket/year/month/year/day/hour/minute/product/region/availabilityzone/
S3://桶/年/月/年/天/小时/分钟/产品/区域/ availabilityzone /
s3://bucketname/2018/03/01/11/30/nest/e1/e1a S3:// bucketname / 2018/04/01 /三十零分之十一/巢/ E1 / E1A
minute is rounded to 30 mins. 分钟四舍五入到30分钟。 If you traffic is high, you can go for higher resolution for minutes or you can reduce by hour or even by day.
如果您的流量很高,您可以在几分钟内获得更高的分辨率,或者您可以减少一小时甚至一天。
It helped a lot based on what data we want to query (using Athena or Redshift Spectrum) and for what time duration. 它基于我们想要查询的数据(使用Athena或Redshift Spectrum)以及持续时间来帮助很多。
Hope it helps. 希望能帮助到你。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.