简体   繁体   English

在非常大的数据库中查询字符串的最佳方法?

[英]Best approach for querying strings in a very large database?

In a Postgres database I'm running a query that searches for a string that is a sha256 hash in a table with approx.在 Postgres 数据库中,我正在运行一个查询,该查询在一个大约为 sha256 hash 的表中搜索字符串。 5*10^8 rows. 5*10^8 行。 This query can take up to 9 seconds, which sounds ok for a single datapoint, however I need to run this join query for 10^9 times (for every item in another table).此查询最多可能需要 9 秒,这对于单个数据点来说听起来不错,但是我需要运行此连接查询 10^9 次(对于另一个表中的每个项目)。 The column that contains sha256 hash is indexed, also I don't have any additional information (id or timestamp) that I could use to search for just a part of the string and that id.包含 sha256 hash 的列已编入索引,而且我没有任何其他信息(id 或时间戳)可用于搜索字符串的一部分和该 id。

My current setup is to call this slow query from a python daemon (using psycopg2), send it the id from the 10^9 rows table, and print out the execution time every 100 execution.我当前的设置是从 python 守护进程(使用 psycopg2)调用这个慢查询,将 10^9 行表中的 id 发送给它,然后每执行 100 次打印出执行时间。 I tried commiting every few queries, didn't make a measurable difference, autocommit is default=off.我尝试提交每几个查询,没有产生可衡量的差异,自动提交默认 = 关闭。

Am I missing something that could run this query faster, or is a better choice to dump my db into something like elasticsearch, then do this searching for strings using elasticsearch?我是否遗漏了可以更快地运行此查询的内容,或者是将我的数据库转储到 elasticsearch 之类的更好的选择,然后使用 elasticsearch 搜索字符串?

EDIT: Explain of the slow query:编辑:解释慢查询:

EXPLAIN UPDATE txout
SET fk_tx_id = txid.tx_id
FROM
(
 SELECT tx.tx_id, txout.tx_hash
 FROM tx tx
 INNER JOIN txout
 ON tx.tx_hash = txout.tx_hash
 WHERE txout.fk_block_id = 398361
) AS txid
WHERE txout.tx_hash = txid.tx_hash
AND txout.fk_block_id = 398361;
-[ RECORD 1 ]--
QUERY PLAN | Update on txout  (cost=149874.29..323547.14 rows=5 width=345)
-[ RECORD 2 ]--
QUERY PLAN |   ->  Nested Loop  (cost=149874.29..323547.14 rows=5 width=345)
-[ RECORD 3 ]--
QUERY PLAN |         ->  Merge Join  (cost=149873.60..150727.71 rows=19864 width=400)
-[ RECORD 4 ]--
QUERY PLAN |               Merge Cond: (txout.tx_hash = txout_1.tx_hash)
-[ RECORD 5 ]--
QUERY PLAN |               ->  Sort  (cost=77894.30..78025.39 rows=52438 width=329)
-[ RECORD 6 ]--
QUERY PLAN |                     Sort Key: txout.tx_hash
-[ RECORD 7 ]--
QUERY PLAN |                     ->  Index Scan using idx_txout_fk_block_id on txout  (cost=0.58..65716.10 rows=52438 width=329)
-[ RECORD 8 ]--
QUERY PLAN |                           Index Cond: (fk_block_id = 398361)
-[ RECORD 9 ]--
QUERY PLAN |               ->  Materialize  (cost=71979.30..72241.49 rows=52438 width=71)
-[ RECORD 10 ]--
QUERY PLAN |                     ->  Sort  (cost=71979.30..72110.39 rows=52438 width=71)
-[ RECORD 11 ]--
QUERY PLAN |                           Sort Key: txout_1.tx_hash
-[ RECORD 12 ]--
QUERY PLAN |                           ->  Index Scan using idx_txout_fk_block_id on txout txout_1  (cost=0.58..65716.10 rows=52438 width=71)
-[ RECORD 13 ]--
QUERY PLAN |                                 Index Cond: (fk_block_id = 398361)
-[ RECORD 14 ]--
QUERY PLAN |         ->  Index Scan using idx_tx_hash on tx  (cost=0.70..8.69 rows=1 width=75)
-[ RECORD 15 ]--
QUERY PLAN |               Index Cond: (tx_hash = txout_1.tx_hash)

It seems you're trying to setup a foreign key from one table to another over a string field.您似乎正在尝试通过字符串字段将外键从一个表设置到另一个表。 Am I correct?我对么?

Postgresql solution If that's the case, building an explicit foreign key (and the associated index) in postgresql would prove the first solution to try although it is sure that with hundreds of millions of row in one side and billions in the other, you'll need a fairly strong setup underlying your postgresql database to build the index. Postgresql 解决方案如果是这种情况,在 postgresql 中构建显式外键(和相关索引)将证明是第一个尝试的解决方案,尽管可以肯定的是,一方面有数亿行,另一方面有数十亿行,你会您的 postgresql 数据库需要一个相当强大的设置来构建索引。 Once done, querying should nevertheless be reasonable.一旦完成,查询仍然应该是合理的。

elasticsearch solution To answer your more global question, using something like elasticsearch changes completely the problem because it uses inverted indexes to query string super efficiently, and is based on a distributed system where data is sharded on several nodes (ie several machines). elasticsearch 解决方案为了回答您更全面的问题,使用 elasticsearch 之类的东西完全改变了问题,因为它使用倒排索引来超级高效地查询字符串,并且基于分布式系统,其中数据在多个节点(即多台机器)上分片。 Therefore, provided you have many instances in your elasticsearch cluster, you can speedup significantly text-searches by breaking down the search among the different shards (which parallelizes the search), and using the pre-computed inverted index.因此,如果您的 elasticsearch 集群中有许多实例,您可以通过分解不同分片之间的搜索(使搜索并行化)并使用预先计算的倒排索引来显着加速文本搜索。 Neverthless, setting up an elasticsearch cluster is a commitment and the ingestion / indexing of billions of records won't be fast either.尽管如此,建立 elasticsearch 集群是一项承诺,数十亿条记录的摄取/索引也不会很快。

Divide and conquer Another direction you can go is to perform the join locally on your computer, possibly splitting the full tables based on the first character of your hashes so you can "paralellize" your join with one job per first character.分而治之 go 的另一个方向是在您的计算机上本地执行连接,可能会根据散列的第一个字符拆分完整的表,这样您就可以“并行化”您的连接,每个第一个字符都有一个作业。 Also, sorting and pre-indexing both table, in postgresql and in memory can speedup such joins significantly.此外,在 postgresql 和 memory 中对两个表进行排序和预索引可以显着加速此类连接。

It is hard providing more guidance without more details of what you're trying to do.如果没有更多关于您正在尝试做的事情的详细信息,很难提供更多指导。

You have a 3 way join, and it not clear what that is supposed to accomplish.您有 3 路加入,但不清楚应该完成什么。 Why not just:为什么不只是:

EXPLAIN (ANALYZE, BUFFERS) UPDATE txout
SET fk_tx_id = tx.tx_id
FROM
tx tx
WHERE txout.fk_block_id = 398361
and txout.tx_hash = txid.tx_hash

Also, there is not much point in executing it 5*10^8 times if there are not that many different values of fk_block_id.此外,如果 fk_block_id 的不同值不多,那么执行 5*10^8 次也没多大意义。 You will just be updating the same rows over and over and setting them to the same thing.您将一遍又一遍地更新相同的行并将它们设置为相同的内容。

I think your query can be simplified to:我认为您的查询可以简化为:

UPDATE txout
    SET fk_tx_id = tx.tx_id
    FROM tx
    WHERE tx.tx_hash = txout.tx_hash AND
          txout.fk_block_id = 398361;

For this query, you want indexes on txout(fk_block_id, tx_hash) and tx(tx_hash) .对于此查询,您需要txout(fk_block_id, tx_hash)tx(tx_hash)上的索引。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM