简体   繁体   中英

How to get fast result from a SQL query

I have a Postgre SQL database table which contains over 5 million entries. Also have a CSV file which contains 100,000 entries.

I need to run a query to get data from DB which related to the CSV file's data.

However as per the everyone's understanding and with my own experience, This kind of query takes ages to get completed. (more than 6 hours, as per my guess)

So as per the newest findings and tools, do we have a better, fast solution to perform this same task?

The fast lane: create a temporary table matching the structure of the CSV file (possibly using an existing table as template for convenience) and use COPY :

Bulk load

CREATE TEMP TABLE tmp(email text);

COPY tmp FROM 'path/to/file.csv';
ANALYZE tmp;                       -- do that for bigger tables!

I am assuming emails in the CSV are unique, you did not specify. If they are not, make them unique:

CREATE TEMP TABLE tmp0
SELECT DISTINCT email
FROM   tmp
ORDER  BY email;  -- ORDER BY cheap in combination with DISTINCT ..
                  -- .. may or may not improve performance additionally.

DROP TABLE tmp;
ALTER TABLE tmp0 RENAME TO tmp;

Index

For your particular case a unique index on email is in order. It is much more efficient to create the index after loading and sanitizing the data. This way you also prevent COPY from bailing out with a unique violation if there should be dupes:

 
 
 
  
  CREATE UNIQUE INDEX tmp_email_idx ON tmp (email);
 
  

On second thought, if all you do is update the big table, you don't need an index on the temporary table at all. It will be read sequentially.

Yes DB table is indexed using primary key.

The only relevant index in this case:

CREATE INDEX tbl_email_idx ON tbl (email);

Make that CREATE UNIQUE INDEX ... if possible.

Update

To update your table as detailed in your later comment:

UPDATE tbl t
SET    ...
FROM   tmp 
WHERE  t.email = tmp.email;

All of this can easily be wrapped into a plpgsql or sql function.
Note that COPY requires dynamic SQL with EXECUTE in a plpgsql function if you want parameterize the file name.

Temporary tables are dropped at the end of the session automatically by default.
Related answer:
How to bulk insert only new rows in PostreSQL

Just a small addition to Erwin answer - if you want just check if email in csv file, the code could be something like this:

create temp table tmp_emails (email text primary key);

copy tmp_emails from 'path/emails.csv';
analyze tmp_emails;

update <your table> set
    ...
from <your table> as d
where exists (select * from tmp_emails as e where e.email = d.email);

I think may be it's possible to create table-returning function which reads your csv and call it like:

update <your table> set
    ...
from <your table> as d
where exists (select * from csv_func('path/emails.csv') as e where e.email = d.email);

But I have no postgresql installed here to try, I'll do it later

If I understand you correctly, you have CSV file with some field, contains KEY, which is used to search through your PostgreSQL table.

I don't know what programming language you can use for this task, but, in general, you have to solve speed problems:

First method, programming:

  1. You need to load CSV file into memory, even ifyour CSV have 500 bytes per line, It would take only 100000 * 500 = 50 Megabytes of your RAM
  2. You need to build some search index for KEY fields of CSV - for example, in PHP you can build array, with keys set to your KEY field values. In C++ you can create some kind of HASH table, which are widely presented by STD lib, other programming languages would give you their variant of it.
  3. The table in PostgreSQL should be indexed by the field, which matches to your KEY field.
  4. Use your csv array, loaded in memory to construct queries like "SELECT * FROM table WHERE key IN(1,2,3,4,5,6,7,8,9)" , where "1,2,3,4..." - is a part (for example, one hundred) of your KEY's form CSV

Second method, natural sql:

  1. Create table and load CSV into it
  2. Create index on the field, which used to search
  3. Create index on the 5Millions' table
  4. User JOIN to get linked tables data

The way you will choose depends on your real task. For example, in my experience, I had to make interface to load price-list into database, and before loading it actually, it was needed to show imported XLS file, with information about "current" and "new" prices, and, because of large size of XLS file, where was pagination needed, so, variant with KEY IN (1,2,3,4,5,6) suit the best.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM