在python中将数据库表写入文件的最快方法

Question

I'm trying to extract huge amounts of data from a DB and write it to a csv file. 我正在尝试从数据库中提取大量数据并将其写入csv文件。 I'm trying to find out what the fastest way would be to do this. 我正在尝试找出最快的方法是这样做。 I found that running writerows on the result of a fetchall was 40% slower than the code below. 我发现在fetchall结果上运行writerows比下面的代码慢40％。

with open(filename, 'a') as f:
    writer = csv.writer(f, delimiter='\t')
    cursor.execute("SELECT * FROM table")
    writer.writerow([i[0] for i in cursor.description])

    count = 0
    builder = []
    row = cursor.fetchone()
    DELIMITERS = ['\t'] * (len(row) - 1) + ['\n']
    while row:
        count += 1
        # Add row with delimiters to builder 
        builder += [str(item) for pair in zip(row, DELIMITERS) for item in pair]
        if count == 1000:
            count = 0
            f.write(''.join(builder))
            builder[:] = []
        row = cursor.fetchone()
    f.write(''.join(builder))

Edit: The database I'm using is unique to the small company that I'm working for, so unfortunately I can't provide much information on that front. 编辑：我正在使用的数据库对于我正在工作的小型公司是唯一的，因此很遗憾，我无法在这方面提供很多信息。 I'm using jpype to connect with the database since the only means of connecting is via a jdbc driver. 我使用jpype连接数据库，因为连接的唯一方法是通过jdbc驱动程序。 I'm running cPython 2.7.5; 我正在运行cPython 2.7.5; would love to use PyPy but it doesn't work with Pandas. 很想使用PyPy，但不适用于Pandas。

Since I'm extracting such a large number of rows, I'm hesitant to use fetchall for fear that I'll run out of memory. 由于要提取大量行，因此我犹豫使用fetchall，因为担心内存不足。 row has comparable performance and is much easier on the eyes, so I think I'll use that. row具有可比的性能，并且在眼睛上要容易得多，所以我想我会使用它。 Thanks a bunch! 谢谢一群！

Answer 1

With the little you've given us to go on, it's hard to be more specific, but… 有了您给我们的一些支持，就很难再具体了，但是…

I've wrapped your code up as a function, and written three alternative versions: 我已经将您的代码包装为一个函数，并编写了三个替代版本：

def row():
    with open(filename, 'w') as f:
        writer = csv.writer(f, delimiter='\t')
        cursor = db.execute("SELECT * FROM mytable")
        writer.writerow([i[0] for i in cursor.description])
        for row in cursor:
            writer.writerow(row)

def rows():
    with open(filename, 'w') as f:
        writer = csv.writer(f, delimiter='\t')
        cursor = db.execute("SELECT * FROM mytable")
        writer.writerow([i[0] for i in cursor.description])
        writer.writerows(cursor)

def rowsall():
    with open(filename, 'w') as f:
        writer = csv.writer(f, delimiter='\t')
        cursor = db.execute("SELECT * FROM mytable")
        writer.writerow([i[0] for i in cursor.description])
        writer.writerows(cursor.fetchall())

Notice that the last one is the one you say you tried. 请注意，最后一个是您说过尝试的一个。

Now, I wrote this test driver: 现在，我编写了这个测试驱动程序：

def randomname():
    return ''.join(random.choice(string.ascii_lowercase) for _ in range(30))

db = sqlite3.connect(':memory:')
db.execute('CREATE TABLE mytable (id INTEGER PRIMARY KEY AUTOINCREMENT, name VARCHAR)')
db.executemany('INSERT INTO mytable (name) VALUES (?)',
               [[randomname()] for _ in range(10000)])

filename = 'db.csv'

for f in manual, row, rows, rowsall:
    t = timeit.timeit(f, number=1)
    print('{:<10} {}'.format(f.__name__, t))

And here are the results: 结果如下：

manual     0.055549702141433954
row        0.03852885402739048
rows       0.03992213006131351
rowsall    0.02850699401460588

So, your code takes nearly twice as long as calling fetchall and writerows in my test! 因此，在我的测试中，您的代码花费的时间几乎是调用fetchall和writerows两倍！

When I repeat a similar test with other databases, however, rowsall is anywhere from 20% faster to 15% slower than manual (never 40% slower, but as much as 15%)… but row or rows is always significantly faster than manual . 但是，当我对其他数据库重复进行类似的测试时， rowsall速度要比manual速度快20％到15％（从不慢40％，但rowsall可快15％）…但是row或rows总是比manual速度快得多。

I think the explanation is that your custom code is significantly slower than csv.writerows , but that in some databases, using fetchall instead of fetchone (or just iterating the cursor) slows things down significantly. 我认为这是因为您的自定义代码比csv.writerows要慢得多，但是在某些数据库中，使用fetchall而不是fetchone （或只是迭代游标）会大大降低速度。 The reason this isn't true with an in-memory sqlite3 database is that fetchone is doing all of the same work as fetchall and then feeding you the list one at a time; 对于内存中的sqlite3数据库而言，这并非如此，原因是fetchall与fetchone做着所有相同的工作，然后一次向您提供列表。 with a remote database, fetchone may do anything from fetch all the lines, to fetching a buffer at a time, to fetching a row at a time, making it potentially much slower or faster than fetchall , depending on your data. 对于远程数据库， fetchone可能会执行任何操作，从获取所有行到一次获取缓冲区，一次到获取一行，这可能会比fetchall慢或快得多，具体取决于您的数据。

But for a really useful explanation, you'd have to tell us exactly which database and library you're using (and which Python version—CPython 3.3.2's csv module seems to be a lot faster than CPython 2.7.5's, and PyPy 2.1/2.7.2 seems to be faster than CPython 2.7.5 as well, but then either one also might run your code faster too…) and so on. 但是对于真正有用的解释，您必须准确地告诉我们您正在使用哪个数据库和库（以及哪个Python版本-CPython 3.3.2的csv模块似乎比CPython 2.7.5和PyPy 2.1快很多。 /2.7.2似乎也比CPython 2.7.5还要快，但是任何一个都可能也可以更快地运行您的代码……），依此类推。

在python中将数据库表写入文件的最快方法

问题描述

1 个解决方案

解决方案1
3 已采纳 2014-01-21 22:35:37

在python中将数据库表写入文件的最快方法

问题描述

1 个解决方案

解决方案1 3 已采纳 2014-01-21 22:35:37

解决方案1
3 已采纳 2014-01-21 22:35:37