简体   繁体   中英

Should we use sequences or identities for our primary keys?

We are creating a new database with 20+ tables, and our database supports:

  • sequences.
  • identity columns (generated always as identity/serial).

So, the question is: should we use sequences or identities? Which one is better? The team seems to be divided on this one, so I wanted to hear pros and cons , to help decide.

Adding database details:

  • We are creating the new database on IBM DB2, but we need to make sure it will be compatible with future plans of migration to PostgreSQL.

Your question is about using sequences versus identity ("generated always as identity" columns, presumably). In Postgres, these would be declared as serial . These would always be some sort of number in a single column.

From the database performance perspective, there is not much difference between the two. One important difference is that some databases will cache identity columns, which can speed inserts but cause gaps. The rules for caching sequences might be different. In a high transaction environment, inadequate caching can be a performance bottleneck. Sharing a single sequence across multiple tables makes this problem worse.

There is a bigger difference from a data management perspective. A sequence requires managing two objects (the table and the sequence). An identity or serial column is built into the table.

For a single table, I have only considered using sequences in databases that do not support built-in auto-increment/serial/identity columns (ahem, "Oracle"). Otherwise, I would use the mechanism designed to work with tables.

I do want to point out that using an auto-incremented surrogate key has other benefits. This should also be the key used for clustering the data, if such a concept exists in the database. New inserts are then always at the "end" (although if you are deleting data, then pages might only be partially used). The primary key should also be the only key used for foreign key references, even if other columns -- in isolation or together -- are unique and candidate primary keys.

The best answer is to point you back to your situation.

First, many people prefer sequences, as they are easy to generate and provide a single data type to navigate your joins. Additionally many shops require single column primary keys to assist further in code complexity.

Let's talk about the downsides:

Sequences: When using b-tree indexes, sequences are generally inserted in ascending order, which can result in an "unbalanced tree" and cause less than perfect performance (on b-tree indexes) over time. Sometimes, people instead generate hashes or GUIDs to result in a more balanced tree.

Sequences can result in "hard to read" code when using "lookup tables", especially when values are hard coded in your database. Example: "where status_seq=1" is harder to read than "where status_id='ACTIVE'".

Downsides of using IDs: Mixed data types can cause confusion. Sometimes they're numeric, sometimes they're varchar or char. Many ORMs can confuse those and leave off leading zeros causing errors in your results. IE 01234 != 1234, but your ORM may return 1234 instead of 01234.

Many people store ID's in human readable form, like "VALID", or state abbreviations. This can cause headaches in the long run, so even if you do use IDs on a table, you may want to steer clear from ever showing those IDs directly to your customer.

ID fields are much more likely to "need to change" in the future, than a sequence. Example: Let's say you have a country code table and a revolution takes place and a country code changes. Do you really want to go through the main table and all the foreign keys that reference it, putting in the new country code-- or living with the old country code, cause that's your choice. If you use a sequence in that case, you simply update other non-key columns in the base table and you're good to go.

Benefits:

Benefits of Sequences: Sequences are by nature automatically generated. IDs aren't always. When adding records, do you really want a programmer or user naming an ID that cannot be easily changed? When you use sequences, there's rarely a need to renumber things, and the underlying human-readable data can be easily changed if a mistake is made.

As mentioned above, they're always a numeric datatype, and if used properly can assist in "navigating" you app (IE, usually only having to "pass around" one number to navigate your table structure)

When using an communicating between the DB and your programming language, you can count on being able to convert integers to integers without any weird data conversion issues.

IDs: Primary benefit is code that's easier to read which we already explained above.

In summary, I think it's on a case, by case basis, depending on table and column usage. If you're going to use IDs, avoid the temptation to show the value to the user. If the table's not going to change and simply holds flags, or "enum" type data, then IDs can certainly help with code readability. Otherwise, sequences are often the better choice for maintainability of your data.

Some people choose GUIDs or IDs to help with index performance, but personally, if there's any loss in code readability or the code gets more complex, I'd spend some money on better hardware before I'd write more complex code-- as the benefit is miniscule.

Source: Oracle certified DBA (training on this exact subject), and 20+ years of experience working with developers and enterprise databases.

I'm a fan of sequences. I like it if all the IDs are the same type, and all the IDs come from the same sequence. It's not necessary, just something that lets you tease out the order in which things occur...which is often not so much a technical requirement, but a debugging aid. I tend to favor bigint to be my key type, so I'm pretty much guaranteed to never run out of IDs. If you're using int keys (or smaller), you'd want to use one sequence per table.

Having said that, there are issues to watch out for with sequences. For example, it's possible to "burn" sequences without actually putting them in data. Again, this may or may not be a problem. Generally, I haven't had to care.

Sequences are typically implemented by making a default constraint on the ID column of a table. This means a couple of things to watch out for. It's possible that a value for the column is actually provided on the insert...which doesn't 'bump' your sequence, and may collide with future inserts that do not provide a value. This to me, is the most significant concern. If all your IDs are provided by defaulting, this is a non-issue.

Procedures (and remote clients) can reserve or "burn" sequences. This is extremely convenient...lets your procedure code know in advance what the IDs are without having to commit them to data. You can always do something like:

insert someTempTable( Id, Name )
select
  next value for dbo.MySequence,
  Name
from
  dbo.SomeTable

...which burns sequence values, but the nice thing is, when I go to insert my rows from my work table someTempTable into the real table, I can rest assured that the IDs aren't going to conflict. This is simpler than with identity column-based IDs. I can build a whole series of related data in temp, and then move it all into persistent storage set-wise. This is usually a lot more efficient.

I've not used sequences, but I can discuss identity fields.

First they work quite nicely in every case where I have used them for the last 18 years of using SQL Server. This is most likely true on other databases as well as this is a critical feature for the databases that use them. We have never had any problems surrounding the use of identities. You might want to define the identity as big int when you set it up if you are expecting to have a very large database.

If you don't set up an identity at the time of table creation, it is a pain to set it up later in SQl Server, check your databases for details there. However, if you are using autogenerated keys exclusively as the PKs, you would do this at the time of table creation.

A critical thing when using identities (or sequences or GUIDs for that matter) is that in addition to the auto generated value, you need to create a unique index for the natural key(s) in your table if you have them. This will prevent data integrity problems.

Other problems can be if you have an issue with numbers being skipped on rollback. Since these are meant to be placeholders, they should not have meaning, so it may not be a problem, but I have seen cases where people needed this functionality for business reasons not technical reasons. Test both with rollbacks to see if you have gaps if you need them to not have gaps. If both have gaps, then you will need to roll your own system watching out for race conditions.

Since you say you are creating a database in DB2 to migrate to Postgres, I would set up a test with a couple of tables with identities in db2 and a couple of tables with sequences. Insert a large amount of fake data into them. Then I would test how difficult it is to port them to the Postgres database and start adding records. This may be a key piece of data in which method is better in your particular case.

You might also consider doing testing concerning performance by inserting a very large number of records to two test tables that are alike except in the way that they assign the Id. It may be that performance is acceptable both ways, it may be that one is faster than the other. the following link is for SQL Server, but the test methodology is probably something you could find useful in making your determinations. http://dba-presents.com/index.php/sql-server/25-identity-vs-sequence-performance-test

It is critical to do your own determinations of things like performance if that is a critical issue because results can be affected by your own particular set-up.

If you want a meaningful ID based on some text values and an incrementing number(such as CA1,CA2, CA3, TX1, TX2, TX3), then an identity will not work but a sequence could I think (see this article: PostgreSQL sequence based on another column ). So sequences can give you more flexibility but if you don't need it, then why bother?

Probably I would also consider that it would be most confusing for maintenance (and in your case conversion) to sometimes use one and sometimes the other. Consistency in how you do things may be the key. If you have one case where sequences give you a flexibility that you must have that identities do not. I would use sequences throughout just to avoid unneeded complexities of knowing which table used what when you do the conversion.

Db2 IDENTITY columns are backed by sequences (which support caching and out of order generation for higher performance) -- the difference is purely syntactic. With an identity column:

create table t1 (
  id integer not null generated always as identity cache 100,
  foobar varchar(111)
)

you do not provide the value for that column, it is generated and inserted automatically:

insert into blah (foobar) values ('something')

If the column is not defined as IDENTITY you must explicitly create a sequence and generate values when inserting rows

create table t2 (
  id integer not null,
  foobar varchar(111)
)

create sequence myseq cache 100

insert into t2 values (next value for myseq, 'something else')

Similarly, when you need to redefine identity properties, you do that via the ALTER TABLE statement; for sequences you use ALTER SEQUENCE .

Only one column in a table can be defined as IDENTITY ; if you need more than one such column you will have to use sequences for them.

Special treatment is necessary when mass-loading data using the LOAD or IMPORT utilities into tables with identity columns -- you will need to either override or ignore identity values that may be present.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM