简体   繁体   English

Python:读取超过1M的小型csv文件并写入数据库

[英]Python: reading over 1M small csv files and writing to a db

I have over a Million times snapshots files that I need to merge and create a single file/db for analysis. 我有超过百万次的快照文件,我需要合并并创建一个文件/数据库进行分析。

My attempt to do this in the code below. 我尝试在下面的代码中执行此操作。 first, I read a small csv from a list of URLs, takes a few columns, parse date field from text to date and writes it to a sqlite database. 首先,我从一个URL列表中读取一个小的csv,需要几列,从文本到日期解析日期字段并将其写入sqlite数据库。

while this code works well enough over a small subset of files, is too slow to iterate over a million CSVs. 虽然这段代码在一小部分文件中运行良好,但迭代超过一百万个CSV的速度太慢。

I'm not sure how to increase performance or even whether Python is the right tool for the job or not. 我不确定如何提高性能,甚至不确定Python是否适合这项工作。 any help in improving this code or suggestions will be much appreciated. 任何帮助改进此代码或建议将不胜感激。

import pandas as pd
from sqlalchemy import create_engine
import datetime
import requests
import csv
import io

csv_database2 = create_engine('sqlite:///csv_database_test.db')

col_num = [0,8,9,12,27,31]

with open('url.csv','r') as line_list:
     reader = csv.DictReader(line_list,)

for line in reader:

    data = requests.get(line['URL'])
    df = pd.read_csv(io.StringIO(data.text), usecols=col_num, infer_datetime_format=True)
    df.columns.values[0] = 'DateTime'
    df['ParseDateTime'] = [datetime.datetime.strptime(t, "%a %b %d %H:%M:%S %Y") for t in df.DateTime]
    df.to_sql('LineList', csv_database2, if_exists='append')

IMHO python is well suited for this task and with simple modifications you can achieve your desired performance. 恕我直言python非常适合这项任务,通过简单的修改,您可以实现您想要的性能。

AFAICS there could be two bottlenecks that affect performance: AFAICS可能存在两个影响性能的瓶颈:

downloading the urls 下载网址

you download a single file at a time, if download a file takes 0.2 sec to download 1M files it'll take > 2 days! 你一次下载一个文件,如果下载一个文件需要0.2秒下载1M文件,它将需要> 2天! I suggest you'll parallelize the download, example code using concurrent.futures : 我建议您使用concurrent.futures并行下载示例代码:

from concurrent.futures import ThreadPoolExecutor
import requests


def insert_url(line):
    """download single csv url and insert it to SQLite"""
    data = requests.get(line['URL'])
    df = pd.read_csv(io.StringIO(data.text), usecols=col_num,
                     infer_datetime_format=True)
    df.columns.values[0] = 'DateTime'
    df['ParseDateTime'] = [
        datetime.datetime.strptime(t, "%a %b %d %H:%M:%S %Y") for t in
        df.DateTime]
    df.to_sql('LineList', csv_database2, if_exists='append')


with ThreadPoolExecutor(max_workers=128) as pool:
    pool.map(insert_url, lines)

inserting to SQL 插入SQL

try to take a look at how to optimize the SQL insertions at this SO answer. 试着看一下如何在这个 SO答案中优化SQL插入。

Further guidance 进一步指导

  • I would start with the parallel requests as it seems larger bottleneck 我会从并行请求开始,因为它似乎是更大的瓶颈
  • run profiler to get better idea where your code spends most of the time 运行探查器以更好地了解代码在大多数情况下花费的时间

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM