简体   繁体   中英

Integer vs char for DB record property vs Wordpress schema

I asked a similar question before ( integer-vs-char-for-db-record-property ) but stumbled upon something that goes against all recommendations that I received in my previous post. In Wordpress 3, the most popular and mature open source blog script, post status is stored as VARCHAR(20) in db - 'publish', 'auto-draft', 'inherit', 'pending', etc. and not as INT with lookup table or mapped string constants, or CHAR , or anything like that. This also applies to the field post_type ('post', 'attachment', 'revision', etc.) and some other fields. So to find all published posts I need to run something like SELECT * FROM posts WHERE post_status = 'published' AND post_type = 'post' . Also, there is a multiple column index on post_status, post_type and some other columns which certainly speeds up this kind of search. Can someone explain why they made it this way and not the other, and what are the benefits and drawbacks of this approach?

Just because some application is well known doesn't mean they had a good database design. This tends to violate the normalization rules. Maybe they get better performance and maybe they didn't look at the other possibilities when they chose this one becasue they didn;t know better. Maybe they were aplication programmers designing a database without understanding database theory very well or maybe it was a deliberate denormailzation with performance stats to back it up. Or maybe they didn't think the chances of having to update 100 million records when we decided we wanted to change the value from 'published' to something else. Maybe they only tested performance on selects but not on updates. Maybe the values genuniely are unlikey to change, so it's not such a big deal to denormalize. We can't know from here.

Normalization is not about replacing strings with numbers or "sharing" strings just because they have the same letters.

I don't know their design, but the following scenario is perfectly normalized even though it uses strings as identifiers.

create table post_statuses(
   status varchar(20) not null
  ,primary key(status)
);

insert into post_statuses values('publish');
insert into post_statuses values('inherit');
insert into post_statuses values('pending');

create table posts(
   post_id ...
   status varchar(20) not null
  ,primary key(post_id)
  ,foreign key(status) references post_statuses(status)
);

The main benefit of using natural keys over surrogate keys is that it reduces the number of joins needed and also the likelyhood that entire classes of queries can be answered from index only. The main drawbacks are increased storage and the possibility of having a hell if we need to change the values.

I'd guess that the WP devs simply avoided what they felt was premature optimization, and opted for better readability instead.

"SELECT * FROM posts WHERE post_status = 'published' AND post_type = 'post'"

is a little little bit easier to read than

"SELECT * FROM posts WHERE post_status = ".WP_POST_STATUS_PUBLISHED."
    AND post_type = ".WP_POST_TYPE_POST.""

And when a new WP developer runs a select * from ... query, the database table lists 'published' rather than 3 or 5, which is easier to understand and debug.

From the point of view of disk storage space, either approach is fairly okay I think — some more post_status bytes shouldn't matter much in comparison to the blog post text and all other columns. An integer is 8 bytes (well unless it's a tinyint) and 'published' is perhaps 10 bytes so doesn't matter much?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM