简体   繁体   English

s3中的数据分区

[英]Data partitioning in s3

We have our data in relational database in single table with columns id and date as this. 我们将关系数据库中的数据放在单个表中,列id和日期为此。

productid    date    value1 value2
1         2005-10-26  24    27
1         2005-10-27  22    28
2         2005-10-26  12    18

Trying to load them to s3 as parquet and create metadata in hive to query them using athena and redshift. 尝试将它们加载到s3作为镶木地板并在配置单元中创建元数据以使用athena和redshift查询它们。 Our most frequent queries will be filtering on product id, day, month and year. 我们最常见的查询将过滤产品ID,日,月和年。 So trying to load the data partitions in a way to have better query performance. 因此,尝试以一种具有更好查询性能的方式加载数据分区。

From what i understood, I can create the partitions like this 根据我的理解,我可以像这样创建分区

s3://my-bucket/my-dataset/dt=2017-07-01/   
...
s3://my-bucket/my-dataset/dt=2017-07-09/   
s3://my-bucket/my-dataset/dt=2017-07-10/

or like this, 或者像这样,

s3://mybucket/year=2017/month=06/day=01/
s3://mybucket/year=2017/month=06/day=02/
...
s3://mybucket/year=2017/month=08/day=31/
  1. Which will be faster in terms of query as I have 7 years data. 由于我有7年的数据,因此在查询方面会更快。
  2. Also, how can i add partitioning for product id here? 另外,我如何在此处为产品ID添加分区? So that it will be faster. 这样它会更快。
  3. How can i create this (s3://mybucket/year=2017/month=06/day=01/) folder structures with key=value using spark scala.? 如何使用spark scala创建具有key = value的(s3:// mybucket / year = 2017 / month = 06 / day = 01 /)文件夹结构。 Any examples? 任何例子?

We partitioned like this, 我们像这样划分,

s3://bucket/year/month/year/day/hour/minute/product/region/availabilityzone/ S3://桶/年/月/年/天/小时/分钟/产品/区域/ availabilityzone /

s3://bucketname/2018/03/01/11/30/nest/e1/e1a S3:// bucketname / 2018/04/01 /三十零分之十一/巢/ E1 / E1A

minute is rounded to 30 mins. 分钟四舍五入到30分钟。 If you traffic is high, you can go for higher resolution for minutes or you can reduce by hour or even by day. 如果您的流量很高,您可以在几分钟内获得更高的分辨率,或者您可以减少一小时甚至一天。

It helped a lot based on what data we want to query (using Athena or Redshift Spectrum) and for what time duration. 它基于我们想要查询的数据(使用Athena或Redshift Spectrum)以及持续时间来帮助很多。

Hope it helps. 希望能帮助到你。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM