简体   繁体   English

我们应该使用序列或身份作为主键吗?

[英]Should we use sequences or identities for our primary keys?

We are creating a new database with 20+ tables, and our database supports: 我们正在创建一个包含20多个表的新数据库,我们的数据库支持:

  • sequences. 序列。
  • identity columns (generated always as identity/serial). 标识列(始终作为标识/序列生成)。

So, the question is: should we use sequences or identities? 所以,问题是:我们应该使用序列还是身份? Which one is better? 哪一个更好? The team seems to be divided on this one, so I wanted to hear pros and cons , to help decide. 球队似乎在这一个进行划分的,所以我想听到的利弊 ,帮助决定。

Adding database details: 添加数据库细节:

  • We are creating the new database on IBM DB2, but we need to make sure it will be compatible with future plans of migration to PostgreSQL. 我们正在IBM DB2上创建新数据库,但我们需要确保它与将来迁移到PostgreSQL的计划兼容。

Your question is about using sequences versus identity ("generated always as identity" columns, presumably). 您的问题是关于使用序列与身份(“大概始终生成身份”列,大概是)。 In Postgres, these would be declared as serial . 在Postgres中,这些将被声明为serial These would always be some sort of number in a single column. 这些将始终是单列中的某种数字。

From the database performance perspective, there is not much difference between the two. 从数据库性能的角度来看,两者之间没有太大区别。 One important difference is that some databases will cache identity columns, which can speed inserts but cause gaps. 一个重要的区别是一些数据库将缓存标识列,这可以加快插入速度但会造成间隙。 The rules for caching sequences might be different. 缓存序列的规则可能不同。 In a high transaction environment, inadequate caching can be a performance bottleneck. 在高事务环境中,缓存不足可能是性能瓶颈。 Sharing a single sequence across multiple tables makes this problem worse. 在多个表之间共享单个序列会使这个问题变得更糟。

There is a bigger difference from a data management perspective. 从数据管理的角度来看,存在更大的差异。 A sequence requires managing two objects (the table and the sequence). 序列需要管理两个对象(表和序列)。 An identity or serial column is built into the table. 表中内置了identityserial列。

For a single table, I have only considered using sequences in databases that do not support built-in auto-increment/serial/identity columns (ahem, "Oracle"). 对于单个表,我只考虑在数据库中使用不支持内置自动增量/串行/标识列的序列(ahem,“Oracle”)。 Otherwise, I would use the mechanism designed to work with tables. 否则,我会使用设计用于表的机制。

I do want to point out that using an auto-incremented surrogate key has other benefits. 我想指出使用自动递增的代理键还有其他好处。 This should also be the key used for clustering the data, if such a concept exists in the database. 如果数据库中存在这样的概念,这也应该是用于聚类数据的密钥。 New inserts are then always at the "end" (although if you are deleting data, then pages might only be partially used). 然后,新插入始终位于“结束”(尽管如果要删除数据,则可能仅部分使用页面)。 The primary key should also be the only key used for foreign key references, even if other columns -- in isolation or together -- are unique and candidate primary keys. 主键也应该是用于外键引用的唯一键,即使其他列 - 隔离或一起 - 是唯一的和候选主键。

The best answer is to point you back to your situation. 最好的答案是指出你的情况。

First, many people prefer sequences, as they are easy to generate and provide a single data type to navigate your joins. 首先,许多人更喜欢序列,因为它们易于生成并提供单一数据类型来导航连接。 Additionally many shops require single column primary keys to assist further in code complexity. 此外,许多商店需要单列主键来进一步协助代码复杂性。

Let's talk about the downsides: 我们来谈谈缺点:

Sequences: When using b-tree indexes, sequences are generally inserted in ascending order, which can result in an "unbalanced tree" and cause less than perfect performance (on b-tree indexes) over time. 序列:当使用b树索引时,序列通常以升序插入,这可能导致“不平衡树”并导致不完美的性能(在b树索引上)随着时间的推移。 Sometimes, people instead generate hashes or GUIDs to result in a more balanced tree. 有时,人们会生成哈希或GUID,以生成更平衡的树。

Sequences can result in "hard to read" code when using "lookup tables", especially when values are hard coded in your database. 使用“查找表”时,序列可能导致“难以阅读”的代码,尤其是在数据库中对值进行硬编码时。 Example: "where status_seq=1" is harder to read than "where status_id='ACTIVE'". 示例:“where status_seq = 1”比“where status_id ='ACTIVE'”更难阅读。

Downsides of using IDs: Mixed data types can cause confusion. 使用ID的缺点:混合数据类型可能会导致混淆。 Sometimes they're numeric, sometimes they're varchar or char. 有时它们是数字的,有时它们是varchar或char。 Many ORMs can confuse those and leave off leading zeros causing errors in your results. 许多ORM可能会混淆这些并导致前导零,从而导致结果出错。 IE 01234 != 1234, but your ORM may return 1234 instead of 01234. IE 01234!= 1234,但您的ORM可能会返回1234而不是01234。

Many people store ID's in human readable form, like "VALID", or state abbreviations. 许多人以人类可读的形式存储ID,例如“有效”或州缩写。 This can cause headaches in the long run, so even if you do use IDs on a table, you may want to steer clear from ever showing those IDs directly to your customer. 从长远来看,这可能会导致令人头疼的问题,因此,即使您确实在桌面上使用ID,您也可能希望避免直接向您的客户显示这些ID。

ID fields are much more likely to "need to change" in the future, than a sequence. 与序列相比,ID字段将来更有可能“需要更改”。 Example: Let's say you have a country code table and a revolution takes place and a country code changes. 示例:假设您有一个国家/地区代码表,并且发生了一次革命并且国家/地区代码发生​​了变化。 Do you really want to go through the main table and all the foreign keys that reference it, putting in the new country code-- or living with the old country code, cause that's your choice. 你真的想通过主表和所有引用它的外键,输入新的国家代码 - 或者使用旧的国家代码,这是你的选择。 If you use a sequence in that case, you simply update other non-key columns in the base table and you're good to go. 如果在这种情况下使用序列,只需更新基表中的其他非键列,就可以了。

Benefits: 优点:

Benefits of Sequences: Sequences are by nature automatically generated. 序列的好处:序列本质上是自动生成的。 IDs aren't always. ID并非总是如此。 When adding records, do you really want a programmer or user naming an ID that cannot be easily changed? 添加记录时,您真的希望程序员或用户命名一个无法轻易更改的ID吗? When you use sequences, there's rarely a need to renumber things, and the underlying human-readable data can be easily changed if a mistake is made. 当您使用序列时,很少需要重新编号,如果出现错误,可以轻松更改基础的人类可读数据。

As mentioned above, they're always a numeric datatype, and if used properly can assist in "navigating" you app (IE, usually only having to "pass around" one number to navigate your table structure) 如上所述,它们总是一个数值数据类型,如果使用得当可以帮助“导航”你的应用程序(IE,通常只需要“传递”一个数字来导航你的表格结构)

When using an communicating between the DB and your programming language, you can count on being able to convert integers to integers without any weird data conversion issues. 使用DB和编程语言之间的通信时,您可以指望能够将整数转换为整数而不会出现任何奇怪的数据转换问题。

IDs: Primary benefit is code that's easier to read which we already explained above. ID:主要好处是我们已经在上面解释过的更容易阅读的代码。

In summary, I think it's on a case, by case basis, depending on table and column usage. 总而言之,我认为这取决于表和列的使用情况。 If you're going to use IDs, avoid the temptation to show the value to the user. 如果您要使用ID,请避免向用户显示值的诱惑。 If the table's not going to change and simply holds flags, or "enum" type data, then IDs can certainly help with code readability. 如果表不会改变并只是持有标志或“枚举”类型数据,那么ID肯定有助于代码可读性。 Otherwise, sequences are often the better choice for maintainability of your data. 否则,序列通常是数据可维护性的最佳选择。

Some people choose GUIDs or IDs to help with index performance, but personally, if there's any loss in code readability or the code gets more complex, I'd spend some money on better hardware before I'd write more complex code-- as the benefit is miniscule. 有些人选择GUID或ID来帮助提高索引性能,但就个人而言,如果代码可读性有任何损失或代码变得更复杂,我会在编写更复杂的代码之前在更好的硬件上花一些钱 - 作为好处微乎其微。

Source: Oracle certified DBA (training on this exact subject), and 20+ years of experience working with developers and enterprise databases. 资料来源:Oracle认证的DBA(关于这一主题的培训),以及20多年与开发人员和企业数据库合作的经验。

I'm a fan of sequences. 我是序列的粉丝。 I like it if all the IDs are the same type, and all the IDs come from the same sequence. 如果所有ID都是相同的类型,我喜欢它,并且所有ID都来自相同的序列。 It's not necessary, just something that lets you tease out the order in which things occur...which is often not so much a technical requirement, but a debugging aid. 这不是必需的,只是让你能够弄清楚事情发生的顺序......这通常不是技术要求,而是调试辅助工具。 I tend to favor bigint to be my key type, so I'm pretty much guaranteed to never run out of IDs. 我倾向于支持bigint作为我的密钥类型,因此我几乎可以保证永远不会用完ID。 If you're using int keys (or smaller), you'd want to use one sequence per table. 如果您使用的是int(或更小),则需要为每个表使用一个序列。

Having said that, there are issues to watch out for with sequences. 话虽如此,但有些问题需要注意。 For example, it's possible to "burn" sequences without actually putting them in data. 例如,可以“刻录”序列而不将它们实际放入数据中。 Again, this may or may not be a problem. 同样,这可能是也可能不是问题。 Generally, I haven't had to care. 一般来说,我没有必要关心。

Sequences are typically implemented by making a default constraint on the ID column of a table. 序列通常通过在表的ID列上创建默认约束来实现。 This means a couple of things to watch out for. 这意味着需要注意的几件事情。 It's possible that a value for the column is actually provided on the insert...which doesn't 'bump' your sequence, and may collide with future inserts that do not provide a value. 实际上可能会在插入中提供列的值...这不会“碰撞”您的序列,并且可能会与未提供值的未来插入冲突。 This to me, is the most significant concern. 对我而言,这是最重要的问题。 If all your IDs are provided by defaulting, this is a non-issue. 如果您的所有ID都是通过默认提供的,那么这不是问题。

Procedures (and remote clients) can reserve or "burn" sequences. 程序(和远程客户端)可以保留或“刻录”序列。 This is extremely convenient...lets your procedure code know in advance what the IDs are without having to commit them to data. 这非常方便...让您的过程代码事先知道ID是什么,而不必将它们提交给数据。 You can always do something like: 你总是可以这样做:

insert someTempTable( Id, Name )
select
  next value for dbo.MySequence,
  Name
from
  dbo.SomeTable

...which burns sequence values, but the nice thing is, when I go to insert my rows from my work table someTempTable into the real table, I can rest assured that the IDs aren't going to conflict. ...它会烧掉序列值,但是好的是,当我将我的工作表someTempTable行插入到真实表中时,我可以放心,ID不会发生冲突。 This is simpler than with identity column-based IDs. 这比基于标识列的ID更简单。 I can build a whole series of related data in temp, and then move it all into persistent storage set-wise. 我可以在temp中构建一系列相关数据,然后将它们全部移动到持久存储中。 This is usually a lot more efficient. 这通常更有效率。

I've not used sequences, but I can discuss identity fields. 我没有使用序列,但我可以讨论身份字段。

First they work quite nicely in every case where I have used them for the last 18 years of using SQL Server. 首先,他们在使用SQL Server过去18年中使用它们的每种情况下都能很好地工作。 This is most likely true on other databases as well as this is a critical feature for the databases that use them. 这很可能在其他数据库上也是如此,这对于使用它们的数据库来说是一个关键特性。 We have never had any problems surrounding the use of identities. 我们从来没有遇到过使用身份的任何问题。 You might want to define the identity as big int when you set it up if you are expecting to have a very large database. 如果您希望拥有一个非常大的数据库,则可能需要在设置时将标识定义为big int。

If you don't set up an identity at the time of table creation, it is a pain to set it up later in SQl Server, check your databases for details there. 如果在创建表时没有设置标识,稍后在SQl Server中进行设置会很麻烦,请检查数据库中的详细信息。 However, if you are using autogenerated keys exclusively as the PKs, you would do this at the time of table creation. 但是,如果您将自动生成的密钥专门用作PK,则可以在创建表时执行此操作。

A critical thing when using identities (or sequences or GUIDs for that matter) is that in addition to the auto generated value, you need to create a unique index for the natural key(s) in your table if you have them. 使用标识(或序列或GUID)时,一个关键的事情是除了自动生成的值之外,如果你有自然生成的值,你需要为表中的自然键创建一个唯一的索引。 This will prevent data integrity problems. 这将防止数据完整性问题。

Other problems can be if you have an issue with numbers being skipped on rollback. 如果您在回滚时跳过了数字问题,则可能存在其他问题。 Since these are meant to be placeholders, they should not have meaning, so it may not be a problem, but I have seen cases where people needed this functionality for business reasons not technical reasons. 由于这些都是占位符,它们不应该有意义,所以它可能不是问题,但我已经看到人们出于商业原因而非技术原因而需要此功能的情况。 Test both with rollbacks to see if you have gaps if you need them to not have gaps. 使用回滚测试两者,看看是否有间隙,如果你需要它们没有间隙。 If both have gaps, then you will need to roll your own system watching out for race conditions. 如果两者都有差距,那么你需要推出自己的系统,注意竞争条件。

Since you say you are creating a database in DB2 to migrate to Postgres, I would set up a test with a couple of tables with identities in db2 and a couple of tables with sequences. 既然你说你在DB2中创建一个数据库来迁移到Postgres,我会设置一个测试,其中包含几个在db2中具有标识的表和一些包含序列的表。 Insert a large amount of fake data into them. 在其中插入大量虚假数据。 Then I would test how difficult it is to port them to the Postgres database and start adding records. 然后我会测试将它们移植到Postgres数据库并开始添加记录是多么困难。 This may be a key piece of data in which method is better in your particular case. 这可能是一个关键数据,其中方法在您的特定情况下更好。

You might also consider doing testing concerning performance by inserting a very large number of records to two test tables that are alike except in the way that they assign the Id. 您还可以考虑通过将大量记录插入到两个相似的测试表中来进行性能测试,除非他们分配Id。 It may be that performance is acceptable both ways, it may be that one is faster than the other. 可能两种方式都可以接受性能,也可能是一种方式比另一种方式更快。 the following link is for SQL Server, but the test methodology is probably something you could find useful in making your determinations. 以下链接适用于SQL Server,但测试方法可能是您在确定时可以找到的有用方法。 http://dba-presents.com/index.php/sql-server/25-identity-vs-sequence-performance-test http://dba-presents.com/index.php/sql-server/25-identity-vs-sequence-performance-test

It is critical to do your own determinations of things like performance if that is a critical issue because results can be affected by your own particular set-up. 如果这是一个关键问题,那么对性能等事情做出自己的决定至关重要,因为结果会受到您自己特定设置的影响。

If you want a meaningful ID based on some text values and an incrementing number(such as CA1,CA2, CA3, TX1, TX2, TX3), then an identity will not work but a sequence could I think (see this article: PostgreSQL sequence based on another column ). 如果你想要一个基于某些文本值和递增数字的有意义的ID(例如CA1,CA2,CA3,TX1,TX2,TX3),那么身份将不起作用,但我认为是一个序列(参见本文: PostgreSQL序列基于另一栏 )。 So sequences can give you more flexibility but if you don't need it, then why bother? 所以序列可以给你更多的灵活性,但如果你不需要它,那么为什么呢?

Probably I would also consider that it would be most confusing for maintenance (and in your case conversion) to sometimes use one and sometimes the other. 也许我会认为维护(以及在你的情况下转换)有时使用一个,有时使用另一个是最困惑的。 Consistency in how you do things may be the key. 一致如何做事可能是关键。 If you have one case where sequences give you a flexibility that you must have that identities do not. 如果你有一个案例,其中序列给你一个灵活性,你必须拥有,但你的身份却没有。 I would use sequences throughout just to avoid unneeded complexities of knowing which table used what when you do the conversion. 我会在整个过程中使用序列,以避免在执行转换时知道哪个表使用了哪些表的不必要的复杂性。

Db2 IDENTITY columns are backed by sequences (which support caching and out of order generation for higher performance) -- the difference is purely syntactic. Db2 IDENTITY列由序列支持(支持缓存和乱序生成以获得更高性能) - 差异纯粹是语法。 With an identity column: 使用标识列:

create table t1 (
  id integer not null generated always as identity cache 100,
  foobar varchar(111)
)

you do not provide the value for that column, it is generated and inserted automatically: 如果没有为该列提供值,则会自动生成并插入该列:

insert into blah (foobar) values ('something')

If the column is not defined as IDENTITY you must explicitly create a sequence and generate values when inserting rows 如果列未定义为IDENTITY ,则必须显式创建序列并在插入行时生成值

create table t2 (
  id integer not null,
  foobar varchar(111)
)

create sequence myseq cache 100

insert into t2 values (next value for myseq, 'something else')

Similarly, when you need to redefine identity properties, you do that via the ALTER TABLE statement; 同样,当您需要重新定义标识属性时,可以通过ALTER TABLE语句执行此操作; for sequences you use ALTER SEQUENCE . 对于序列,您使用ALTER SEQUENCE

Only one column in a table can be defined as IDENTITY ; 表中只能将一列定义为IDENTITY ; if you need more than one such column you will have to use sequences for them. 如果您需要多个此类列,则必须使用序列。

Special treatment is necessary when mass-loading data using the LOAD or IMPORT utilities into tables with identity columns -- you will need to either override or ignore identity values that may be present. 使用LOADIMPORT实用程序将数据批量加载到具有标识列的表中时,需要特殊处理 - 您需要覆盖或忽略可能存在的标识值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM