简体   繁体   中英

How does postgres implement a sequential scan?

I understand that when the majority of a table is estimated to be required in the result set for a given query, that a sequential scan may be preferred over using an index.

What I'm curious about is how postgres actually reads the pages into memory?

Does it organise them into some kind of ad-hoc in memory index whilst it reads them?

What if the table's too large to fit into memory?

Are there any high level papers on the topic?

(I've done some searching but results are full of blog posts explaining the basics of indexing, not the implementation details of a sequential scan. I expect it's not as straightforward as read into an array when evaluating a join condition over most of a table)

What I'm curious about is how postgres actually reads the pages into memory?

The engine reads the whole heap in any order while discarding rows marked as deleted. Hot blocks (already present in the cache) are much faster to process.

Does it organise them into some kind of ad-hoc in memory index whilst it reads them?

No, a sequential scan avoids indexes and reads the heap directly using buffering and the cache.

What if the table's too large to fit into memory?

A sequential scan is pipelined . This means I/O blocks are read as needed. The engine does not need to have the whole heap in memory before it starts processing it. It read a few blocks, then process them and discards them; then it does this again and again until it reads all the blocks of the heap.

Are there any high level papers on the topic?

There should be but, anyway, any good book on query optimization will describe this process in detail.

EDIT For Your Second Question:

What I guess I mean is if you're joining on some random column X, does it have to iterate through each possible row multiple times to find the correct row for each value in the other table, or does it do something more advanced than that?

Well, when you join a couple of tables (or more) the engine query planner produces a plan that includes a "Nested Loop", a "Hash Join", or a "Merge Join" operator. There are more operators but these are the common ones.

  • The Nested Loop Join retrieves rows for the linked table that match the first one. It could perform an index seek or scan on the related table (ideal) or a full table scan (not ideal).

  • The Hash Join hashes the secondary table first (incurring in high startup cost) and then joins fast.

  • The Merge Join sorts both tables by the join key (assuming an equi-join), again incurring in heavy startup cost) and then joins fast (like a zipper).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM