简体   繁体   English

优化Postgres表中的存储空间,查询速度和JSON列数据

[英]Optimizing the storage space, query speed, JSON column data in a Postgres table

Consider the following table that records the changes in prices of different products belonging to different companies of different categories. 考虑下表,该表记录了属于不同类别的不同公司的不同产品的价格变化。

     Column    |  Type  | Modifiers
-----------------+--------+-----------
 category_id   | bigint | not null
 product_id    | bigint | not null
 industry_id   | bigint | not null
 time          | bigint | not null
 price         | bigint | not null
 product_info  | json   | not null

Indexes:
    "price_change_pk" PRIMARY KEY, btree (category_id, product_id, price, "time")

Foreign-key constraints:
    "orders_industry_id" FOREIGN KEY (industry_id) REFERENCES industry_info(industry_id)
    "orders_product_id" FOREIGN KEY (product_id) REFERENCES device_info(product_id)
    "orders_category_id" FOREIGN KEY (categoy_id) REFERENCES category_info(category_id)

To be clear column values will be : 为了清楚起见,列值将为:

category_id - a separate table will have the id(unique bigint value) mapped to the category name - 100s of categories category_id一个单独的表将具有映射到类别名称的id(唯一的bigint值)-100类别

(Electronics, Fashion, Health, Sports, Toys, Books) (电子,时尚,健康,体育,玩具,书籍)

industry_id - a separate table will have the id(unique bigint value) mapped to the industry name - several 1000s of industries in a category industry_id一个单独的表将具有映射到行业名称的id(唯一的bigint值)-一个类别中的数千个行业

(Nokia, Apple, Microsoft, PeterEngland, Rubik, Nivia, Cosco) (诺基亚,苹果,微软,PeterEngland,Rubik,Nivia,中远)

product_id - a separate table will have the id(unique bigint value) mapped to the product name - millions of products in an industry product_id一个单独的表将具有映射到产品名称的id(唯一的bigint值)-一个行业中的数百万种产品

time (unix time as bigint) - time at which the price was modified, time (unix时间为bigint)-修改价格的时间,

price - several thousands of distinct values - (200, 10000, 14999, 30599, 450) price -几千个不同的值-(200、10000、14999、30599、450)

product_info - a json that holds the extra details of the product (number of keys/value pairs may vary) product_info包含产品额外详细信息的json(键/值对的数量可能有所不同)

{seller:"ABC Assured", discount:10, model:XYZ, EMIoption:true, EMIvalue:12, festival_offer:28, market_stat:comingsoon}

The table is queried in several ways to analyze the trend of a product price being changed, as a chart, in a day/week/month as hour/day/week/month ranges. 以多种方式查询该表,以图表的形式分析日/周/月,小时/天/周/月范围内的产品价格变化趋势。 The trend may be based on no. 趋势可能基于否。 of products, unique products being modified. 产品,正在修改的独特产品。

For example Google Sample Trend 例如Google样本趋势

Storing JSON as it is (as string ) makes use of more storage. 按原样存储JSON(作为string )可利用更多存储空间。 So I tried storing, key-value in the json with an incrementing serial id in a separate table, and those ids are used. 因此,我尝试将键值和递增的序列ID存储在json中的单独表中,并使用这些ID。

Like 喜欢

Keys (citext, bigint)
seller - 1
discount - 2
model - 3
EMIoption - 4
EMIvalue - 5
festival_offer - 6
...
...
currency - 25

Values (citext, bigint)
ABC Assured - 1
10 - 2
XYZ - 3
true - 4
12 - 5
28 - 6
comingsoon - 7
...
...
ZYX - 106
rupees - 107
american dollars - 108
canadian dollars - 109
Prime seller - 110

{seller:"ABC Assured", discount:10, model:XYZ, EMIoption:true, EMIvalue:12, festival_offer:28, market_stat:comingsoon, curreny: rupees}

becomes

{"1":1, "2":2", "3":3, "4":4, "5":5, "6":6, "7":7, "25":107}


{seller:"Prime seller", discount:10, model:XYZ, EMIoption:true, EMIvalue:12, festival_offer:28, market_stat:comingsoon, curreny: "canadian dollars"}

becomes

{"1":110, "2":2", "3":3, "4":4, "5":5, "6":6, "7":7, "25":109}


For about 20M data set, it reduced about 1.5GB. 对于大约20M的数据集,它减少了约1.5GB。

Increase in key-value cardinality, increases the serial numbers. 增加键值基数,增加序列号。 So I tried storing the decimal as hexadecimals. 所以我尝试将十进制存储为十六进制。

{"1":1, "2":2", "3":3, "4":4, "5":5, "6":6, "7":7, "25":107}

becomes

{"1":1, "2":2", "3":3, "4":4, "5":5, "6":6, "7":7, "19":"6B"}


{"1":110, "2":2", "3":106, "4":4, "5":5, "6":6, "7":7, "25":109}

becomes

{"1":, "2":2", "3":"6A", "4":4, "5":5, "6":6, "7":7, "19":"6D"}


So does storing these decimal integers as hexadecimal integers. 将这些十进制整数存储为十六进制整数也是如此。

  1. Save storage space furthe ? 进一步节省存储空间? (because visually it seems compressed) (因为在视觉上似乎已压缩)
  2. Does the JSON retain the data type of key-value, or they are stored as strings? JSON是否保留键值的数据类型,或者将它们存储为字符串?
  3. Makes the data compressed? 使数据压缩?
  4. Improve read performance? 提高阅读性能?
  5. Or in anyway it can be improved? 还是无论如何都可以改进? (Indexing, or any?) (索引或其他?)

In a normal psql application, queries takes several minutes to complete. 在普通的psql应用程序中,查询需要几分钟才能完成。 Since it complies with Time-Series data, we use TimescaleDB extension, and its sharding mechanism boosts the query execution but we need results in sub-seconds. 由于它符合时间序列数据,因此我们使用TimescaleDB扩展,它的分片机制可以提高查询的执行速度,但我们需要几秒钟的结果。

Query samples : To check how many times price was changed to 500, for all products, in a given category, in a month group by every day. 查询样本 :检查给定类别中所有产品每天(每月)的价格更改为500次的次数。

select count(*), to_char(date_trunc('day', to_timestamp(time/1000) at time zone 'Asia/Kolkata'), 'YYYY/MM/DD') as unit, price 
from price_change 
where category_id = 1000000010 and time between 1514745000000 and 1517423400000 
  and price = 500 
group by price, unit;

To check how many times price was changed to any of (100,200,300,400,500,600,700,800,900,1000), for all products, in a given category, in the last 10 months group by every month. 查看过去10个月中每个月在给定类别中所有产品的价格更改为(100,200,300,400,500,600,700,800,900,1000)中的任何一个的次数。

select count(*), to_char(date_trunc('month', to_timestamp(time/1000) at time zone 'Asia/Kolkata'), 'YYYY/MM/DD') as unit, price 
from price_change 
where category_id = 1000000010 and time between  1514745000000 and 1517423400000  
   and price in (100,200,300,400,500,600,700,800,900,1000) group by price, unit;

To select the product details whose price has been changed in the given time range, in a given category 在给定类别中选择在给定时间范围内价格已更改的产品详细信息

select product_id, product_name, price, to_char(date_trunc('day', to_timestamp(time/1000) at time zone 'Asia/Kolkata'), 'YYYY/MM/DD') as timestamp 
from price_change 
  join products using product_id 
where price_change.category_id = 1000000010 
  and price_change.time between 1514745000000 and 1517423400000;

To select the industry and product id details whose price has been changed in the given time range, in a given category 在给定类别中,选择价格在给定时间范围内已更改的行业和产品ID详细信息

select industry_id, product_id, price 
from price_change 
  join industries using industry_id 
where price_change.category_id = 1000000010 
  and price_change.time between 1514745000000 and 1517423400000;

To select product price change details, in a time range with discount of 10%, in a specific category 要选择产品价格更改的详细信息,请在特定类别中的折扣为10%的时间范围内

select product_id, product_name, price, to_char(date_trunc('day', to_timestamp(time/1000) at time zone 'Asia/Kolkata'), 'YYYY/MM/DD') as timestamp 
from price_change 
  join products using product_id 
where price_change.category_id = 1000000010 
  and price_change.time between 1514745000000 and 1517423400000
  and product_info->>'discount'=10;

To select product price change details, in a time range sold by a specific seller, in a specific category 要选择产品价格变动的详细信息,请在特定类别的特定卖家销售的时间范围内

select product_id, product_name, price, to_char(date_trunc('day', to_timestamp(time/1000) at time zone 'Asia/Kolkata'), 'YYYY/MM/DD') as timestamp 
from price_change 
  join products using product_id 
where price_change.category_id = 1000000010 
  and price_change.time between 1514745000000 and 1517423400000
  and product_info->>'seller'='ABC Assured';

In most cases, the query will not contain category_id in the select columns. 在大多数情况下,查询在选择列中将不包含category_id

It would help if you also provide some examples of what you typically query on. 如果您还提供一些通常查询的示例,则将有所帮助。 There are different ways to optimize indexes / how data is written on disk that depend very much on what type of query you are running (more specifically, what is in your where clause)? 有多种优化索引的方法/如何将数据写入磁盘的方式很大程度上取决于您正在运行的查询类型(更具体地说,where子句中的查询)? If you are using where clauses that look into the JSON, you should consider either breaking those out into columns, or building indexes on the JSON itself. 如果您使用的是可查看JSON的where子句,则应考虑将其分成几列,或在JSON本身上建立索引。

It sounds like one of your concerns is storage. 听起来您的关注点之一是存储。 Because TimescaleDB and PostgreSQL are relational, they do take up more storage than, perhaps, a columnar store that might feature better compression characteristics. 因为TimescaleDB和PostgreSQL是关系型的,所以它们占用的存储空间比可能具有更好压缩特性的列式存储空间要多。 You could consider using something like ZFS to compress things as well. 您可以考虑使用ZFS之类的东西来压缩东西。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM