简体   繁体   中英

Normalization in database with countries as columns

This has been bugging me for a while, consider a table with attributes like this: {ID, Value, Australia, India, France, Germany} , where ID is the primary key, Value is some text, say car-model and under each attribute like Australia , India is the number of cars manufactured corresponding to that value.

Intuitively I know that the correct way to put this by {ID, Value, Cars-Manufactured, Country} , but can someone tell me why this is correct in terms of database normalization? Which normalization does the first table not meet. Or is the first table correct too?

The rule it violates is "no repeating groups". This is one of the rules for first normal form.

A column for each country is a repeating group. The data under each column is the same data, just applicable in a different context. When there is only one value there -- like number of cars made in that country -- this may not be obvious, maybe it's even debatable. But suppose we need two pieces of information for each country, like number manufactured and number sold. Now the table has a set of paired columns: Australia_manufactured, Australia_sold, India_manufactured, India_sold, France_manufactured, France_sold, etc. You have a set of two columns repeated multiple times.

Someone could ask, What is the difference between multiple distinct fields and a repeating group? How is "India_manufactured, Australia_manufactured, France_manufactured" different from "number_manufactured, price, description"? The difference is that in the first case, the semantic meaning of the value is the same, all that differs is a context, an application. In the second case, the semantic meaning is different. That is, it is hard to imagine a query or program that processes the data beyond a trivial "find the biggest value" or some such in which we would run it today processing number_manufactured, and then run it tomorrow doing exactly the same processing but on sale_price. But we could easily imagine running today for India and tomorrow for Germany.

Of course there are times when it can be ambiguous. That's why database designers get paid the big bucks. :-)

Okay, that's the rule. Does the rule have practical value?

Let's consider scenario A, one table:

model (model_id, description, india_manufactured, australia_manufactured, france_manufactured)

Scenario B, two tables:

model (model_id, description)
production (model_id, country_code, manufactured)

There are a number of reasons why scenario A sucks. Here's the biggest:

Queries are much simpler with Scenario B. We do not have to hard-code countries into our program or query. Write a query to accept a country code as a parameter and return the number of each model manufactured in that country. In scenario B, simple:

select description, manufactured 
from model join production on model.model_id=production.model_id
where production.country_code=@country

Easy. Now do it with scenario A. Something like:

select description,
  case when @country_code='IN' then india_manufactured
  when @country_code='AU' then australia_manufactured
  when @country_code='FR' then france_manufactured
  else null
  end as manufactured
from model

Or suppose we want the total produced in all countries. Scenario B:

select description, sum(manufactured)
from model
join production on model.model_id=production.model_id

Scenario A:

select description, india_manufactured+australia_manufactured+france_manufactured
from model

(Might be more complex if we have to allow for nulls.)

We'd likely have many, many such queries throughout the system. In real life, many would be much more complex than this, with multiple such messy case clauses or juggling multiple columns. Now suppose we add another country. In scenario B, this is zero effort. We can add and delete countries all we like and the queries don't change. but in scenario A, we would have to find every query and change it. If we miss one, we won't get any compile errors or anything like that. We'll just mysteriously get incorrect results.

Oh, and by the way, it's likely that there will be times when we only want to process some of the countries. Like, say some of the countries have a VAT and some don't, or whatever. In scenario B, we add a column for this fact and test on it. That's just "join country on country.country_code=production.country_code and country.vat=1". In scenario A the programmer would almost surely end up hard-coding the list of specific countries in each query. Then someone comes along later and sees that query X processes India and France and query Y processes France and Germany and query Z processes Germany and Singapore and he might well have no idea why. Even if he knows, the list is hard-coded in every query so every update requires updating every query, changing code rather than changing data.

suppose we come across a query that only processes three of the four countries.

Oh, and by the way,

How do we know whether this is a mistake, someone forgot one of the countries when writing the query or missed this query when a new country was added; or whether there is some reason why this country was excluded?

The second approach is better for you as you will better clarity in terms of the data and also you can avoid INSERT DELETE and UPDATE anomalies. Yes with the second approach you will have more data in terms of number.

Basically when you design a DB the normal approach is to go for 3NF.

Table COUNTRYANDCARS [MODEL (PK), AUSTRALIA, INDIA, FRANCE, GERMANY]

Ideally the above approach is correct when you have only fixed countries.

Table CARPRODUCTION [MODEL (PK), COUNTRY (PK), COUNT]

This would meet for all.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM