简体   繁体   English

使用python将数据从csv导入到postgres DB时如何添加文件名和日期

[英]how to add filename and date when importing data from csv to postgres DB using python

I need to update the row with the CSV filename and the time data was inserted into the database.我需要用 CSV 文件名更新行,并将时间数据插入数据库。 I can use the below code to insert data from CSV into the database:我可以使用下面的代码将 CSV 中的数据插入到数据库中:

with open('my.csv', 'r') as f:    
next(f)
cur.copy_from(f, 'csv_import', sep=',')

but for my requirement along with the csv data, there are 2 more columns which needs to be updated.但根据我的要求以及 csv 数据,还有 2 列需要更新。

  1. filename of the csv file csv 文件的文件名
  2. timestamp when data was loaded加载数据时的时间戳

how can we achieve this?我们怎样才能做到这一点?

The timestamp values can be done with triggers and teh CSV file would need to be updated from the operational system,if it is linux could be added to the execution plan batch file or cronschedule.时间戳值可以用触发器完成,并且 CSV 文件需要从操作系统更新,如果它是 linux 可以添加到执行计划批处理文件或 cronschedule。 Trigger:扳机:

CREATE OR REPLACE FUNCTION csvvalues()
RETURNS TRIGGER AS 
$body$
BEGIN 
NEW.timestamp = NOW; 
RETURN NEW;
END 
$body$
LANGUAGE plpgsql ; 
CREATE TRIGGER csvvalues BEFORE ISNERT ON csv_import
FOR EACH ROW EXECUTE FUNCTION csvvalues();

and command line to be executed on the file location:和要在文件位置执行的命令行:

       pqsl -U postgres -h 192.168.1.100 -p 5432 -d database -c "with open('my.csv', 'r') as f: next(f) cur.copy_from(f, 'csv_import', sep=',')"
password
        mv file1.csvfile2.csv

if it is out from the server you can install the psql of current postgresql version and run the command line from the operation system being used如果它来自服务器,您可以安装当前 postgresql 版本的 psql 并从正在使用的操作系统运行命令行

Setup:设置:

cat test.csv
1,john123@gmail.com,"John Stokes"
2,emily123@gmail.com,"Emily Ray"

create table csv_add(id integer, mail varchar, name varchar, file_name varchar, ts_added timestamptz);


Code to add file name and timestamp:添加文件名和时间戳的代码:

import csv
from io import StringIO
import psycopg2

con = psycopg2.connect(dbname="test", host='localhost', user='postgres', port=5432)

cur = con.cursor()

with open('test.csv') as f:
    reader = csv.reader(f)
    file_name = f.name
    out = StringIO()
    for row in reader:
        out.write(','.join(row + [file_name, datetime.now().isoformat()])+'\n')
    out.seek(0)
    cur.copy_from(out, 'csv_add', ',')
    con.commit()


select * from csv_add ;
 id |        mail        |    name     | file_name |            ts_added            
----+--------------------+-------------+-----------+--------------------------------
  1 | john123@gmail.com  | John Stokes | test.csv  | 12/18/2022 09:58:20.475244 PST
  2 | emily123@gmail.com | Emily Ray   | test.csv  | 12/18/2022 09:58:20.475256 PST



Sorry, gonna stick to pseudo-code (hopefully got the right raw sql syntax, don't usually write sql raw).抱歉,我会坚持伪代码(希望有正确的原始 sql 语法,通常不要写 sql raw)。

What I have often seen done is to use 2 tables, one a staging, import table, one the real table.我经常看到的做法是使用 2 个表,一个是临时表,一个是导入表,一个是真实表。

So...所以...

with open('my.csv', 'r') as f:    
    next(f)
    cur.copy_from(f, 'csv_import', sep=',')

cur.execute("""
insert into actual (f1, f2, ts, filename) 
select f1, f2,%(ts)s,%(filename)s 
from csv_import""", 
dict(ts = time(), filename="my.csv")
)

cur.execute("delete from csv_import")

Don't try to get fancy with * on the right side to skip the field list - that depends on column order in db, which may change with Alter Tables later.不要试图在右侧使用*来跳过字段列表 - 这取决于数据库中的列顺序,稍后可能会随着 Alter Tables 而改变。 Been there, done that - picking up after someone did just that.去过那里,做过那件事——在有人做过那件事之后接手。

This supports higher volumes easily and does not clutter your db with triggers.这很容易支持更高的容量,并且不会使您的数据库因触发器而混乱。 You can also reuse the staging table to get feeds from elsewhere, say an email gateway.您还可以重用暂存表从其他地方获取提要,比如电子邮件网关。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM