Fastest way to select a lot of rows based on their ID in PostgreSQL?

Question

I am using postgres with libpqxx, and I have a table that we will simplify down to

data_table
{
  bytea id PRIMARY KEY,
  BigInt size
}

If I have a set of ID's in cpp, eg std::unordered_set<ObjectId> Ids , what is the best way to get the ID and the Size parameters out of data_table?

I have so far used a prepared statement: constexpr char* preparedStatement = "SELECT size FROM data_table WHERE id = $1";

Then in a transaction I have called that prepared statement for every entry in the set, and retrieved the result for every entry in the set,

pqxx::work transaction(SomeExistingPqxxConnection);
std::unordered_map<ObjectId, uint32_t> result;
for (const auto& id : Ids)
{
  auto transactionResult = transaction.exec_prepared(preparedStatement, ToPqxxBinaryString(id));
  result.emplace(id, transactionResult[0][0].as<uint32_t>());
}
return result;

Because the set can contain tens of thousands of objects, and the table can contain millions, this can take quite some time to process, and I don't think it is a particularly efficient use of postgres.

I am pretty much brand new to SQL, so I don't really know if what I am doing is the right way to go about this, or if this is a much more efficient way.

E: For what it's worth the ObjectId class is basically a type wrapper over std::array<uint8_t, 32>, aka a 256 bit cryptographic hash.

Answer 1

The task as I understand it:

Get id (PK) and size (bigint) for "tens of thousands of objects" from a table with millions of rows and presumably several more columns ("simplified down").

The fastest way of retrieval is index-only scans . The cheapest way to get that in your particular case would be a "covering index" for your query by "including" the size column in the PK index like this (requires Postgres 11 or later):

CREATE TEMP TABLE data_table (
  id   bytea
, size bigint 
, PRIMARY KEY (id) INCLUDE (size)  -- !
)

About covering indexes:

Do covering indexes in PostgreSQL help JOIN columns?

Then retrieve all rows in a single query (or few queries) for many IDs at once like:

SELECT id, size
FROM   data_table
JOIN  (
    VALUES ('id1'), ('id2') -- many more 
    ) t(id) USING (id);

Or one of the other methods laid out here:

Query table by indexes from integer array

Or create a temporary table and join to it.
But do not "insert all those IDs one by one into it". Use the much faster COPY (or the meta-command \copy in psql) to fill the temp table. See:

How to update selected rows with values from a CSV file in Postgres?

And you do not need an index on the temporary table, as that one will be read in a sequential scan anyway. You only need the covering PK index I lined out.

You may want to ANALYZE the temporary table after filling it, to give Postgres some column statistics to work with. But as long as you get the index-only scans I am aiming for, you can skip that, too. The query plan won't get any better than that.

Answer 2

The id is a primary key and so is indexed, so my first concern would be query setup time. A stored procedure is precompiled, for instance. A second tack is to put your set in a temp table, possibly also keyed on the id, so the two tables/indexes can be joined in one select. The indexes for this should be ordered, like tree not hash, so they can be merged.

Fastest way to select a lot of rows based on their ID in PostgreSQL?

Question

2 answers

solution1
1 2020-07-01 03:35:54

solution2
0 2020-07-01 02:03:59

Fastest way to select a lot of rows based on their ID in PostgreSQL?

Question

2 answers

solution1 1 2020-07-01 03:35:54

solution2 0 2020-07-01 02:03:59

solution1
1 2020-07-01 03:35:54

solution2
0 2020-07-01 02:03:59