繁体   English   中英

导入多个 CSV 文件到 Postgresql

[英]Import Multiple CSV Files to Postgresql

我目前正在学习如何编码,并且遇到了过去几天我一直试图解决的这个挑战。

I have over 2000 CSV files that I would like to import into a particular postgresql table at once instead using the import data function on pgadmin 4 which only allows one to import one CSV file at a time. 我应该如何 go 这样做? 我正在使用 Windows 操作系统。

简单的方法是使用 Cygwin 或内部 Ubuntu shell 来使用这个脚本

all_files=("file_1.csv" "file_2.csv") # OR u can change to * in dir

dir_name=<path_to_files>

export PGUSER=<username_here>
export PGPASSWORD=<password_here>
export PGHOST=localhost
export PGPORT=5432
db_name=<dbname_here>

echo "write db"
for file in ${all_files[*]}; do
  psql -U$db_name -a -f $dir_name/"${file}"".sql" >/dev/null
done

如果您想纯粹在 Python 中执行此操作,那么我在下面给出了一种方法。 您可能不需要对列表进行分块(您可以一次保存 memory 中的所有文件,而无需分批进行)。 也有可能所有文件的大小都完全不同,您需要比批处理更复杂的东西来防止您创建超出 RAM 的内存文件 object。 或者,您可能会选择在 2000 个单独的事务中执行此操作,但我怀疑某种批处理会更快(未经测试)。

import csv
import io
import os
import psycopg2

CSV_DIR = 'the_csv_folder/' # Relative path here, might need to be an absolute path

def chunks(l, n):
    """ 
    https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks
    """
    n = max(1, n)
    return [l[i:i+n] for i in range(0, len(l), n)]


# Get a list of all the CSV files in the directory
all_files = os.listdir(CSV_DIR)

# Chunk the list of files. Let's go with 100 files per chunk, can be changed
chunked_file_list = chunks(all_files, 100)

# Iterate the chunks and aggregate the files in each chunk into a single
# in-memory file
for chunk in chunked_file_list:

    # This is the file to aggregate into
    string_buffer = io.StringIO()
    csv_writer = csv.writer(string_buffer)

    for file in chunk:
        with open(CSV_DIR + file) as infile:
            reader = csv.reader(infile)
            data = reader.readlines()

        # Transfer the read data to the aggregated file
        csv_writer.writerows(data)

    # Now we have aggregated the chunk, copy the file to Postgres
    with psycopg2.connect(dbname='the_database_name', 
                          user='the_user_name',
                          password='the_password', 
                          host='the_host') as conn:
        c = conn.cursor()

        # Headers need to the table field names, in the order they appear in
        # the csv
        headers = ['first_name', 'last_name', ...]

        # Now upload the data as though it was a file
        c.copy_from(string_buffer, 'the_table_name', sep=',', columns=headers)
        conn.commit()

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM