简体   繁体   English

批处理/拆分PostgreSQL数据库

[英]Batching / Splitting a PostgreSQL database

I am working on a project which processes data in batches and fills up a PostgreSQL (9.6, but I could upgrade) database. 我正在处理一个项目,该项目分批处理数据并填充PostgreSQL(9.6,但我可以升级)数据库。 The way it currently works is that the process happens in separate steps and each step adds data to a table that it owns (rarely two processes write in the same table, if they do, they write in different column). 当前的工作方式是,该过程在单独的步骤中发生,并且每个步骤都将数据添加到它拥有的表中(很少有两个过程写在同一张表中,如果有,则它们写在不同的列中)。

The way the data happens to be, the data tends to become more and more fine-grained with each step. 数据碰巧的方式是,每个步骤的数据趋于变得越来越细。 As a simplified example I have one table defining the data sources. 作为简化示例,我有一个表定义了数据源。 There are very few (in the tens/ low hundreds), but each of these data sources generate batches of data samples (batches and samples are separate tables, to store metadata). 很少(几十个/几百个),但是每个这些数据源都会生成一批数据样本(批次和样本是单独的表,用于存储元数据)。 Each batch typically generates about 50k samples. 每批次通常会产生约5万个样本。 Each of these data points then gets processed step-by-step and each data sample generates more data-points in the next table. 然后,将逐步处理这些数据点中的每个数据点,并且每个数据样本将在下表中生成更多数据点。

This worked fine, until we got to a 1.5mil rows in the sample table (which is not a lot of data from our point of view). 这一直很好,直到样本表中的行数达到150万(从我们的角度来看,这不是很多数据)。 Now filtering for a batch starts becoming slow (about 10ms for each sample we retrieve). 现在,批次过滤开始变慢(我们检索到的每个样本大约10毫秒)。 And it starts becoming a major bottleneck, because the execution time to get the data for a batch take 5-10mins (fetching is ms). 它开始成为一个主要的瓶颈,因为获取批处理数据的执行时间需要5-10分钟(获取时间为ms)。

We have b-tree indices on all foreign keys that are involved for these queries. 这些查询涉及的所有外键上都有b树索引。

Since our computations target the batches, I do normally not need to query across batches during the computation (this is when the query time hurts a lot at the moment). 由于我们的计算以批次为目标,因此在计算期间,我通常不需要在批次之间进行查询(此时查询时间非常麻烦)。 However for data-analysis reasons ad-hoc queries across batches need to remain possible. 但是,出于数据分析的原因,跨批次的临时查询仍需要保持。

So a very simple solution would be to generate an individual database for each batch, and somehow query across these databases when I need to. 因此,一个非常简单的解决方案是为每个批次生成一个单独的数据库,并在需要时以某种方式在这些数据库中进行查询。 If I had only one batch in each database, obviously the filtering for a single batch would be instant and my problem would be solved (for now). 如果每个数据库中只有一个批处理,显然对单个批处理的过滤将是即时的,并且我的问题将得到解决(目前)。 However, then I would end up with thousands of databases and the data-analysis would be painful. 但是,最后我将拥有成千上万个数据库,而数据分析将是痛苦的。

Within PostgreSQL, is there a way of pretending that I have separate databases for some queries? 在PostgreSQL中,有没有办法假装我有一些查询的单独数据库? Ideally I would like to do that for each batch when I "register" a new batch. 理想情况下,当我“注册”新批次时,我想为每个批次执行此操作。

Outside of the world of PostgreSQL, is there another database I should try for my usecase? 在PostgreSQL之外,我是否应该为用例尝试另一个数据库?

Edit: DDL / Schema 编辑:DDL /架构

In our current implementation, sample_representation is the table that all processing results depend on. 在我们当前的实现中, sample_representation是所有处理结果所依赖的表。 A batch is truly defined by a tuple of ( batch.id , representation.id ). 一批次是真正通过(的元组限定batch.idrepresentation.id )。 The query I tried and described above as slow is (10ms for each sample, adding up to around 5 min for 50k samples) 我尝试并在上面描述为缓慢的查询(每个样本10毫秒,50k样本总计约5分钟)

SELECT sample_representation.id, sample.sample_pos
FROM sample_representation
JOIN sample ON sample.id = sample_representation.id_sample
WHERE sample_representation.id_representation = 'representation-uuid' AND sample.id_batch = 'batch-uuid'

We have currently somewhere around 1.5 s sample s, 2 representation s, 460 batch es (of which 49 have been processed, the others do not have samples associated to it), which means each batch has 30k samples in average. 当前,我们大约有1.5 s个sample ,2个representation ,460个batch (其中有49个已经处理过,其他没有相关的样本),这意味着每批次平均有3万个样本。 Some have around 50k. 有些大约有5万。

The schema is below. 该架构如下。 There is some meta-data associated with all tables, but I am not querying for it in this case. 有一些与所有表相关联的元数据,但是在这种情况下,我不进行查询。 The actual sample-data are stored separately on disk and not in the database, in case that makes a difference. 实际的样本数据会分开存储在磁盘上,而不是存储在数据库中,以防发生变化。

渲染架构

    create table batch
(
    id uuid default uuid_generate_v1mc() not null
        constraint batch_pk
            primary key,
    path text not null
        constraint unique_batch_path
            unique,
    id_data_source uuid
)
;
create table sample
(
    id uuid default uuid_generate_v1mc() not null
        constraint sample_pk
            primary key,
    sample_pos integer,
    id_batch uuid
        constraint batch_fk
            references batch
                on update cascade on delete set null
)
;
create index sample_sample_pos_index
    on sample (sample_pos)
;
create index sample_id_batch_sample_pos_index
    on sample (id_batch, sample_pos)

;
create table representation
(
    id uuid default uuid_generate_v1mc() not null
        constraint representation_pk
            primary key,
    id_data_source uuid
)
;
create table data_source
(
    id uuid default uuid_generate_v1mc() not null
        constraint data_source_pk
            primary key
)
;
alter table batch
    add constraint data_source_fk
        foreign key (id_data_source) references data_source
            on update cascade on delete set null
;
alter table representation
    add constraint data_source_fk
        foreign key (id_data_source) references data_source
            on update cascade on delete set null
;
create table sample_representation
(
    id uuid default uuid_generate_v1mc() not null
        constraint sample_representation_pk
            primary key,
    id_sample uuid
        constraint sample_fk
            references sample
                on update cascade on delete set null,
    id_representation uuid
        constraint representation_fk
            references representation
                on update cascade on delete set null
)
;
create unique index sample_representation_id_sample_id_representation_uindex
    on sample_representation (id_sample, id_representation)
;
create index sample_representation_id_sample_index
    on sample_representation (id_sample)
;
create index sample_representation_id_representation_index
    on sample_representation (id_representation)
;

After fiddling around, I found a solution. 摆弄后,我找到了解决方案。 But I am still not sure why the original query really takes that much time: 但是我仍然不确定为什么原始查询确实需要那么多时间:

SELECT sample_representation.id, sample.sample_pos
FROM sample_representation
JOIN sample ON sample.id = sample_representation.id_sample
WHERE sample_representation.id_representation = 'representation-uuid' AND sample.id_batch = 'batch-uuid'

Everything is indexed, but the tables are relatively big with 1.5 million rows in sample_representation and in sample . 一切都已建立索引,但表相对较大,在sample_representationsample有150万行。 I guess what happens is that first the tables get joined and then filtered with WHERE . 我猜发生了什么事,那就是首先将表连接起来,然后使用WHERE进行过滤。 But even if creating a large view as a result of the join, it should not take that long?! 但是,即使由于连接而创建了一个大视图,它也不应该花那么长时间?

In any case, I tried to use a CTE instead of joining two "massive" tables. 无论如何,我都尝试使用CTE而不是联接两个“大量”表。 The idea was to filter early and then join afterwards: 想法是先过滤,然后再加入:

WITH sel_samplerepresentation AS (
  SELECT *
  FROM sample_representation
  WHERE id_representation='1437a5da-e4b1-11e7-a254-7fff1955d16a'
  ), sel_samples AS (
  SELECT *
  FROM sample
  WHERE id_video='75c04b9c-e4b9-11e7-a93f-132baa27ac91'
)
SELECT sel_samples.sample_pos, sel_samplerepresentation.id
FROM sel_samplerepresentation
JOIN sel_samples ON sel_samples.id = sel_samplerepresentation.id_representation

This query also takes forever. 此查询也将永远花费。 Here the reason is clear. 这里的原因很明显。 sel_samples and sel_samplerepresentation have 50k records each. sel_samplessel_samplerepresentation具有5万条记录。 The join happens on a non-indexed column of the CTEs. 联接发生在CTE的非索引列上。

Since there are no indices for CTEs, I reformulated them as materialized views for which I can add indices: 由于没有CTE的索引,因此我将它们重新构造为物化视图,可以为其添加索引:

CREATE MATERIALIZED VIEW sel_samplerepresentation AS (
  SELECT *
  FROM sample_representation
  WHERE id_representation='1437a5da-e4b1-11e7-a254-7fff1955d16a'
  );

CREATE MATERIALIZED VIEW sel_samples AS (
  SELECT *
  FROM sample
  WHERE id_video = '75c04b9c-e4b9-11e7-a93f-132baa27ac91'
);

CREATE INDEX sel_samplerepresentation_sample_id_index ON sel_samplerepresentation (id_sample);
CREATE INDEX sel_samples_id_index ON sel_samples (id);

SELECT sel_samples.sample_pos, sel_samplerepresentation.id
FROM sel_samplerepresentation
JOIN sel_samples ON sel_samples.id = sel_samplerepresentation.id_sample;

DROP MATERIALIZED VIEW sel_samplerepresentation;
DROP MATERIALIZED VIEW sel_samples;

This is more of a hack than a solution, but executing these queries takes 1s! 这比解决方案更像是骇客,但是执行这些查询需要1秒钟! (down from 8min) (从8分钟开始)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM