简体   繁体   English

在python中将数据库表写入文件的最快方法

[英]Fastest way to write database table to file in python

I'm trying to extract huge amounts of data from a DB and write it to a csv file. 我正在尝试从数据库中提取大量数据并将其写入csv文件。 I'm trying to find out what the fastest way would be to do this. 我正在尝试找出最快的方法是这样做。 I found that running writerows on the result of a fetchall was 40% slower than the code below. 我发现在fetchall结果上运行writerows比下面的代码慢40%。

with open(filename, 'a') as f:
    writer = csv.writer(f, delimiter='\t')
    cursor.execute("SELECT * FROM table")
    writer.writerow([i[0] for i in cursor.description])

    count = 0
    builder = []
    row = cursor.fetchone()
    DELIMITERS = ['\t'] * (len(row) - 1) + ['\n']
    while row:
        count += 1
        # Add row with delimiters to builder 
        builder += [str(item) for pair in zip(row, DELIMITERS) for item in pair]
        if count == 1000:
            count = 0
            f.write(''.join(builder))
            builder[:] = []
        row = cursor.fetchone()
    f.write(''.join(builder))

Edit: The database I'm using is unique to the small company that I'm working for, so unfortunately I can't provide much information on that front. 编辑:我正在使用的数据库对于我正在工作的小型公司是唯一的,因此很遗憾,我无法在这方面提供很多信息。 I'm using jpype to connect with the database since the only means of connecting is via a jdbc driver. 我使用jpype连接数据库,因为连接的唯一方法是通过jdbc驱动程序。 I'm running cPython 2.7.5; 我正在运行cPython 2.7.5; would love to use PyPy but it doesn't work with Pandas. 很想使用PyPy,但不适用于Pandas。

Since I'm extracting such a large number of rows, I'm hesitant to use fetchall for fear that I'll run out of memory. 由于要提取大量行,因此我犹豫使用fetchall,因为担心内存不足。 row has comparable performance and is much easier on the eyes, so I think I'll use that. row具有可比的性能,并且在眼睛上要容易得多,所以我想我会使用它。 Thanks a bunch! 谢谢一群!

With the little you've given us to go on, it's hard to be more specific, but… 有了您给我们的一些支持,就很难再具体了,但是…

I've wrapped your code up as a function, and written three alternative versions: 我已经将您的代码包装为一个函数,并编写了三个替代版本:

def row():
    with open(filename, 'w') as f:
        writer = csv.writer(f, delimiter='\t')
        cursor = db.execute("SELECT * FROM mytable")
        writer.writerow([i[0] for i in cursor.description])
        for row in cursor:
            writer.writerow(row)

def rows():
    with open(filename, 'w') as f:
        writer = csv.writer(f, delimiter='\t')
        cursor = db.execute("SELECT * FROM mytable")
        writer.writerow([i[0] for i in cursor.description])
        writer.writerows(cursor)

def rowsall():
    with open(filename, 'w') as f:
        writer = csv.writer(f, delimiter='\t')
        cursor = db.execute("SELECT * FROM mytable")
        writer.writerow([i[0] for i in cursor.description])
        writer.writerows(cursor.fetchall())

Notice that the last one is the one you say you tried. 请注意,最后一个是您说过尝试的一个。

Now, I wrote this test driver: 现在,我编写了这个测试驱动程序:

def randomname():
    return ''.join(random.choice(string.ascii_lowercase) for _ in range(30))

db = sqlite3.connect(':memory:')
db.execute('CREATE TABLE mytable (id INTEGER PRIMARY KEY AUTOINCREMENT, name VARCHAR)')
db.executemany('INSERT INTO mytable (name) VALUES (?)',
               [[randomname()] for _ in range(10000)])

filename = 'db.csv'

for f in manual, row, rows, rowsall:
    t = timeit.timeit(f, number=1)
    print('{:<10} {}'.format(f.__name__, t))

And here are the results: 结果如下:

manual     0.055549702141433954
row        0.03852885402739048
rows       0.03992213006131351
rowsall    0.02850699401460588

So, your code takes nearly twice as long as calling fetchall and writerows in my test! 因此,在我的测试中,您的代码花费的时间几乎是调用fetchallwriterows两倍!

When I repeat a similar test with other databases, however, rowsall is anywhere from 20% faster to 15% slower than manual (never 40% slower, but as much as 15%)… but row or rows is always significantly faster than manual . 但是,当我对其他数据库重复进行类似的测试时, rowsall速度要比manual速度快20%到15%(从不慢40%,但rowsall可快15%)…但是rowrows总是比manual速度快得多。

I think the explanation is that your custom code is significantly slower than csv.writerows , but that in some databases, using fetchall instead of fetchone (or just iterating the cursor) slows things down significantly. 我认为这是因为您的自定义代码比csv.writerows要慢得多,但是在某些数据库中,使用fetchall而不是fetchone (或只是迭代游标)会大大降低速度。 The reason this isn't true with an in-memory sqlite3 database is that fetchone is doing all of the same work as fetchall and then feeding you the list one at a time; 对于内存中的sqlite3数据库而言,这并非如此,原因是fetchallfetchone做着所有相同的工作,然后一次向您提供列表。 with a remote database, fetchone may do anything from fetch all the lines, to fetching a buffer at a time, to fetching a row at a time, making it potentially much slower or faster than fetchall , depending on your data. 对于远程数据库, fetchone可能会执行任何操作,从获取所有行到一次获取缓冲区,一次到获取一行,这可能会比fetchall慢或快得多,具体取决于您的数据。

But for a really useful explanation, you'd have to tell us exactly which database and library you're using (and which Python version—CPython 3.3.2's csv module seems to be a lot faster than CPython 2.7.5's, and PyPy 2.1/2.7.2 seems to be faster than CPython 2.7.5 as well, but then either one also might run your code faster too…) and so on. 但是对于真正有用的解释,您必须准确地告诉我们您正在使用哪个数据库和库(以及哪个Python版本-CPython 3.3.2的csv模块似乎比CPython 2.7.5和PyPy 2.1快很多。 /2.7.2似乎也比CPython 2.7.5还要快,但是任何一个都可能也可以更快地运行您的代码……),依此类推。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM