Multi-table, multi-row SQL select

Question

How would I list all of the info about a freelancer given the schema below? Including niche, language, market, etc. The issue I am having is that every freelancer can have multiple entries for each table. So, how would I do this? Is it even possible using SQL or would I need to use my primary language (golang) for this?

CREATE TABLE freelancer (
  freelancer_id         SERIAL PRIMARY KEY,
  ip                    inet NOT NULL,
  username              VARCHAR(20) NOT NULL,
  password              VARCHAR(100) NOT NULL,
  email                 citext NOT NULL UNIQUE,
  email_verified        int NOT NULL,
  fname                 VARCHAR(20) NOT NULL,
  lname                 VARCHAR(20) NOT NULL,
  phone_number          VARCHAR(30) NOT NULL,
  address               VARCHAR(50) NOT NULL,
  city                  VARCHAR(30) NOT NULL,
  state                 VARCHAR(30) NOT NULL,
  zip                   int NOT NULL,
  country               VARCHAR(30) NOT NULL,
);

CREATE TABLE market (
market_id       SERIAL PRIMARY KEY,
market_name     VARCHAR(30) NOT NULL,
);

CREATE TABLE niche (
niche_id        SERIAL PRIMARY KEY,
niche_name      VARCHAR(30) NOT NULL,
);

CREATE TABLE medium (
medium_id       SERIAL PRIMARY KEY,
medium_name     VARCHAR(30) NOT NULL,
);

CREATE TABLE format (
format_id       SERIAL PRIMARY KEY,
format_name     VARCHAR(30) NOT NULL,
);

CREATE TABLE lang (
lang_id         SERIAL PRIMARY KEY,
lang_name       VARCHAR(30) NOT NULL,
);

CREATE TABLE freelancer_by_niche (
id      SERIAL PRIMARY KEY,
niche_id        int NOT NULL REFERENCES niche (niche_id),
freelancer_id   int NOT NULL REFERENCES freelancer (freelancer_id)
);


CREATE TABLE freelancer_by_medium (
id      SERIAL PRIMARY KEY,
medium_id       int NOT NULL REFERENCES medium (medium_id),
freelancer_id   int NOT NULL REFERENCES freelancer (freelancer_id)

);

CREATE TABLE freelancer_by_market (
id      SERIAL PRIMARY KEY,
market_id       int NOT NULL REFERENCES market (market_id),
freelancer_id   int NOT NULL REFERENCES freelancer (freelancer_id)
);

CREATE TABLE freelancer_by_format (
id      SERIAL PRIMARY KEY,
format_id       int NOT NULL REFERENCES format (format_id),
freelancer_id   int NOT NULL REFERENCES freelancer (freelancer_id)

);

CREATE TABLE freelancer_by_lang (
id      SERIAL PRIMARY KEY,
lang_id         int NOT NULL REFERENCES lang (lang_id),
freelancer_id   int NOT NULL REFERENCES freelancer (freelancer_id)

);

Answer 1

SELECT *  
FROM freelancer  
INNER JOIN freelancer_by_niche USING (freelancer_id)  
INNER JOIN niche USING (niche_id)  
INNER JOIN freelancer_by_medium USING (freelancer_id)  
INNER JOIN medium USING (medium_id)  
INNER JOIN freelancer_by_market USING (freelancer_id)  
INNER JOIN market USING (market_id)  
INNER JOIN freelancer_by_format USING (freelancer_id)  
INNER JOIN format USING (format_id)  
INNER JOIN freelancer_by_lang USING (freelancer_id)  
INNER JOIN lang USING (lang_id);

And if you want to lose the unnecessary attributes from join tables like freelancer_by_format , then you can do this

SELECT a.ip, a.username, a.password, a.email, a.email_verified,  
a.fname, a.lname, a.phone_number, a.address, a.city,  
a.state, a.zip, a.country,  
b.niche_name, c.medium_name, d.market_name, e.format_name, f.lang_name  
FROM freelancer a  
INNER JOIN freelancer_by_niche USING (freelancer_id)  
INNER JOIN niche b USING (niche_id)  
INNER JOIN freelancer_by_medium USING (freelancer_id)  
INNER JOIN medium c USING (medium_id)  
INNER JOIN freelancer_by_market USING (freelancer_id)  
INNER JOIN market d USING (market_id)  
INNER JOIN freelancer_by_format USING (freelancer_id)  
INNER JOIN format e USING (format_id)  
INNER JOIN freelancer_by_lang USING (freelancer_id)  
INNER JOIN lang f USING (lang_id);

And if you want to change the column names, for example change "market_name" to just "market", then you go with

SELECT a.ip, ... ,  
       d.market_name "market", e.format_name AS "format", ...  
FROM ...

Remarks In your join tables (for example freelancer_by_niche ) there is not UNIQUE constraint on freelancer_id , which means that you could have the same freelancer in multiple markets (that's ok and probably intended).

But then you also don't have a UNIQUE constraint on both attributes (freelancer_id, niche_id) , which means that every freelancer could be in the SAME niche multiple times. ("Joe is in electronics. Three times"). You could prevent that by making (freelancer_id, niche_id) UNIQUE in freelancer_by_niche . This way you would also not need a surrogate (artificial) PRIMARY KEY freelancer_by_id (id) .

So what could go wrong then?

For example imagine the same information about a freelancer in the same niche three times (the same data parts of the row three times):

freelancer_by_niche  
id | freelancer_id | niche_id  
 1 |       1       |    1    -- <-- same data (1, 1), different serial id
 2 |       1       |    1    -- <-- same data (1, 1), different serial id
 3 |       1       |    1    -- <-- same data (1, 1), different serial id

Then the result of the above query would return each possible row three (!) times with the same (!) content, because freelancer_by_niche can be combined three times with all the other JOIN s.

You can eliminate duplicates by using SELECT DISTINCT a.id, ... FROM ... above with DISTINCT . What if you get many duplicate rows, for example 10 data duplicates in each of the 5 JOIN tables (freelancer_by_niche, freelancer_by_medium etc)? You would get 10 * 10 * 10 * 10 * 10 = 10 ^ 5 = 100000 duplicates, which all have the exact same information. If you then ask your DBMS to eliminate duplicates with SELECT DISTINCT ... then it has to sort 100000 duplicate rows per different row , because duplicates can be detected by sorting only (or hashing, but never mind). If you have 1000 different rows for freelancers on markets, niches, languages etc, then you are asking your DBMS to SORT 1.000 * 100.000 = 100.000.000 rows to reduce the duplicates down to the unique 1000 rows. That is 100 million unnecessary rows.

Please make UNIQUE (freelancer_id, niche_id) for freelancer_by_niche and the other JOIN tables.

(By data duplicates i mean that the data (niche_id, freelancer_id) is the same, and only the id is auto incremented serial.)

You can easily reproduce the problem by doing the following:

-- this duplicates all data of your JOIN tables once. Do it many times.
INSERT INTO freelancer_by_niche  
  SELECT (niche_id, freelancer_id) FROM freelancer_by_niche;  
INSERT INTO freelancer_by_medium  
  SELECT (medium_id, freelancer_id) FROM freelancer_by_medium;  
INSERT INTO freelancer_by_market  
  SELECT (market_id, freelancer_id) FROM freelancer_by_market;  
INSERT INTO freelancer_by_format  
  SELECT (format_id, freelancer_id) FROM freelancer_by_format;  
INSERT INTO freelancer_by_lang  
  SELECT (lang_id, freelancer_id) FROM freelancer_by_lang;

Display the duplicates using

SELECT * FROM freelancer_by_lang;

Now try the SELECT * FROM freelancer INNER JOIN ... thing. If it still runs fast, then do all the INSERT INTO freelancer_by_niche ... again and again, until it takes forever to calculate the results. (or you get duplicates, which you can remove with DISTINCT).

Create UNIQUE data JOIN tables

You can prevent duplicates in your join tables. Remove the id SERIAL PRIMARY KEY and replace it with a multi-attribute PRIMARY KEY (a, b):

CREATE TABLE freelancer_by_niche (
   niche_id        int NOT NULL REFERENCES niche (niche_id),
   freelancer_id   int NOT NULL REFERENCES freelancer (freelancer_id), 
   PRIMARY KEY (freelancer_id, niche_id)
);

(Apply this for all your join tables). The PRIMARY KEY (freelancer_id, niche_id) will create a UNIQUE index. This way you cannot insert duplicate data (try the INSERT s above, the will be rejected, because the information is already there once. Adding another time will not add more information AND would make your query runtime much slower).

NON-unique index on the other part of the JOIN tables With PRIMARY KEY (freelancer_id, niche_id) , Postgres creates a unique index on these two attributes (columns). Accessing or JOINing by freelancer_id is fast, because it's first in the index. Accessing or JOINing into freelancer_by_niche.niche_id will be slow (Full Table Scan on freelancer_by_niche ).

Therefore you should create an INDEX on the second part niche_id in this table freelancer_by_niche , too.

CREATE INDEX ON freelancer_by_niche (niche_id) ;

Then joins into this table on niche_id will also be faster, because they are accelerated by an index. The index makes queries faster (usually).

Summary

You have a very good normalized database schema! It's very good. But small improvements can be made (see above).

Multi-table, multi-row SQL select

Question

1 answers

solution1
1 2017-05-08 23:08:41

Multi-table, multi-row SQL select

Question

1 answers

solution1 1 2017-05-08 23:08:41

solution1
1 2017-05-08 23:08:41