简体   繁体   中英

how to add filename and date when importing data from csv to postgres DB using python

I need to update the row with the CSV filename and the time data was inserted into the database. I can use the below code to insert data from CSV into the database:

with open('my.csv', 'r') as f:    
next(f)
cur.copy_from(f, 'csv_import', sep=',')

but for my requirement along with the csv data, there are 2 more columns which needs to be updated.

  1. filename of the csv file
  2. timestamp when data was loaded

how can we achieve this?

The timestamp values can be done with triggers and teh CSV file would need to be updated from the operational system,if it is linux could be added to the execution plan batch file or cronschedule. Trigger:

CREATE OR REPLACE FUNCTION csvvalues()
RETURNS TRIGGER AS 
$body$
BEGIN 
NEW.timestamp = NOW; 
RETURN NEW;
END 
$body$
LANGUAGE plpgsql ; 
CREATE TRIGGER csvvalues BEFORE ISNERT ON csv_import
FOR EACH ROW EXECUTE FUNCTION csvvalues();

and command line to be executed on the file location:

       pqsl -U postgres -h 192.168.1.100 -p 5432 -d database -c "with open('my.csv', 'r') as f: next(f) cur.copy_from(f, 'csv_import', sep=',')"
password
        mv file1.csvfile2.csv

if it is out from the server you can install the psql of current postgresql version and run the command line from the operation system being used

Setup:

cat test.csv
1,john123@gmail.com,"John Stokes"
2,emily123@gmail.com,"Emily Ray"

create table csv_add(id integer, mail varchar, name varchar, file_name varchar, ts_added timestamptz);


Code to add file name and timestamp:

import csv
from io import StringIO
import psycopg2

con = psycopg2.connect(dbname="test", host='localhost', user='postgres', port=5432)

cur = con.cursor()

with open('test.csv') as f:
    reader = csv.reader(f)
    file_name = f.name
    out = StringIO()
    for row in reader:
        out.write(','.join(row + [file_name, datetime.now().isoformat()])+'\n')
    out.seek(0)
    cur.copy_from(out, 'csv_add', ',')
    con.commit()


select * from csv_add ;
 id |        mail        |    name     | file_name |            ts_added            
----+--------------------+-------------+-----------+--------------------------------
  1 | john123@gmail.com  | John Stokes | test.csv  | 12/18/2022 09:58:20.475244 PST
  2 | emily123@gmail.com | Emily Ray   | test.csv  | 12/18/2022 09:58:20.475256 PST



Sorry, gonna stick to pseudo-code (hopefully got the right raw sql syntax, don't usually write sql raw).

What I have often seen done is to use 2 tables, one a staging, import table, one the real table.

So...

with open('my.csv', 'r') as f:    
    next(f)
    cur.copy_from(f, 'csv_import', sep=',')

cur.execute("""
insert into actual (f1, f2, ts, filename) 
select f1, f2,%(ts)s,%(filename)s 
from csv_import""", 
dict(ts = time(), filename="my.csv")
)

cur.execute("delete from csv_import")

Don't try to get fancy with * on the right side to skip the field list - that depends on column order in db, which may change with Alter Tables later. Been there, done that - picking up after someone did just that.

This supports higher volumes easily and does not clutter your db with triggers. You can also reuse the staging table to get feeds from elsewhere, say an email gateway.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM