I have a Postgre SQL database table which contains over 5 million entries. Also have a CSV file which contains 100,000 entries.
I need to run a query to get data from DB which related to the CSV file's data.
However as per the everyone's understanding and with my own experience, This kind of query takes ages to get completed. (more than 6 hours, as per my guess)
So as per the newest findings and tools, do we have a better, fast solution to perform this same task?
The fast lane: create a temporary table matching the structure of the CSV file (possibly using an existing table as template for convenience) and use COPY
:
CREATE TEMP TABLE tmp(email text);
COPY tmp FROM 'path/to/file.csv';
ANALYZE tmp; -- do that for bigger tables!
I am assuming emails in the CSV are unique, you did not specify. If they are not, make them unique:
CREATE TEMP TABLE tmp0
SELECT DISTINCT email
FROM tmp
ORDER BY email; -- ORDER BY cheap in combination with DISTINCT ..
-- .. may or may not improve performance additionally.
DROP TABLE tmp;
ALTER TABLE tmp0 RENAME TO tmp;
For your particular case a
unique index on email is in order.
It is much more efficient to create the index
after loading and sanitizing the data.
This way you also prevent
COPY
from bailing out with a unique violation if there should be dupes:
CREATE UNIQUE INDEX tmp_email_idx ON tmp (email);
On second thought, if all you do is update the big table, you don't need an index on the temporary table at all. It will be read sequentially.
Yes DB table is indexed using primary key.
The only relevant index in this case:
CREATE INDEX tbl_email_idx ON tbl (email);
Make that CREATE UNIQUE INDEX ...
if possible.
To update your table as detailed in your later comment:
UPDATE tbl t
SET ...
FROM tmp
WHERE t.email = tmp.email;
All of this can easily be wrapped into a plpgsql or sql function.
Note that COPY
requires dynamic SQL with EXECUTE
in a plpgsql function if you want parameterize the file name.
Temporary tables are dropped at the end of the session automatically by default.
Related answer:
How to bulk insert only new rows in PostreSQL
Just a small addition to Erwin answer - if you want just check if email in csv file, the code could be something like this:
create temp table tmp_emails (email text primary key);
copy tmp_emails from 'path/emails.csv';
analyze tmp_emails;
update <your table> set
...
from <your table> as d
where exists (select * from tmp_emails as e where e.email = d.email);
I think may be it's possible to create table-returning function which reads your csv and call it like:
update <your table> set
...
from <your table> as d
where exists (select * from csv_func('path/emails.csv') as e where e.email = d.email);
But I have no postgresql installed here to try, I'll do it later
If I understand you correctly, you have CSV file with some field, contains KEY, which is used to search through your PostgreSQL table.
I don't know what programming language you can use for this task, but, in general, you have to solve speed problems:
First method, programming:
Second method, natural sql:
The way you will choose depends on your real task. For example, in my experience, I had to make interface to load price-list into database, and before loading it actually, it was needed to show imported XLS file, with information about "current" and "new" prices, and, because of large size of XLS file, where was pagination needed, so, variant with KEY IN (1,2,3,4,5,6) suit the best.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.