简体   繁体   中英

Fastest way to select a lot of rows based on their ID in PostgreSQL?

I am using postgres with libpqxx, and I have a table that we will simplify down to

data_table
{
  bytea id PRIMARY KEY,
  BigInt size
}

If I have a set of ID's in cpp, eg std::unordered_set<ObjectId> Ids , what is the best way to get the ID and the Size parameters out of data_table?

I have so far used a prepared statement: constexpr char* preparedStatement = "SELECT size FROM data_table WHERE id = $1";

Then in a transaction I have called that prepared statement for every entry in the set, and retrieved the result for every entry in the set,

pqxx::work transaction(SomeExistingPqxxConnection);
std::unordered_map<ObjectId, uint32_t> result;
for (const auto& id : Ids)
{
  auto transactionResult = transaction.exec_prepared(preparedStatement, ToPqxxBinaryString(id));
  result.emplace(id, transactionResult[0][0].as<uint32_t>());
}
return result;

Because the set can contain tens of thousands of objects, and the table can contain millions, this can take quite some time to process, and I don't think it is a particularly efficient use of postgres.

I am pretty much brand new to SQL, so I don't really know if what I am doing is the right way to go about this, or if this is a much more efficient way.

E: For what it's worth the ObjectId class is basically a type wrapper over std::array<uint8_t, 32>, aka a 256 bit cryptographic hash.

The task as I understand it:

Get id (PK) and size (bigint) for "tens of thousands of objects" from a table with millions of rows and presumably several more columns ("simplified down").

The fastest way of retrieval is index-only scans . The cheapest way to get that in your particular case would be a "covering index" for your query by "including" the size column in the PK index like this (requires Postgres 11 or later):

CREATE TEMP TABLE data_table (
  id   bytea
, size bigint 
, PRIMARY KEY (id) INCLUDE (size)  -- !
)

About covering indexes:

Then retrieve all rows in a single query (or few queries) for many IDs at once like:

SELECT id, size
FROM   data_table
JOIN  (
    VALUES ('id1'), ('id2') -- many more 
    ) t(id) USING (id);

Or one of the other methods laid out here:

Or create a temporary table and join to it.
But do not "insert all those IDs one by one into it". Use the much faster COPY (or the meta-command \copy in psql) to fill the temp table. See:

And you do not need an index on the temporary table, as that one will be read in a sequential scan anyway. You only need the covering PK index I lined out.

You may want to ANALYZE the temporary table after filling it, to give Postgres some column statistics to work with. But as long as you get the index-only scans I am aiming for, you can skip that, too. The query plan won't get any better than that.

The id is a primary key and so is indexed, so my first concern would be query setup time. A stored procedure is precompiled, for instance. A second tack is to put your set in a temp table, possibly also keyed on the id, so the two tables/indexes can be joined in one select. The indexes for this should be ordered, like tree not hash, so they can be merged.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM