[英]Fastest way to write database table to file in python
I'm trying to extract huge amounts of data from a DB and write it to a csv file. 我正在尝试从数据库中提取大量数据并将其写入csv文件。 I'm trying to find out what the fastest way would be to do this.
我正在尝试找出最快的方法是这样做。 I found that running writerows on the result of a fetchall was 40% slower than the code below.
我发现在fetchall结果上运行writerows比下面的代码慢40%。
with open(filename, 'a') as f:
writer = csv.writer(f, delimiter='\t')
cursor.execute("SELECT * FROM table")
writer.writerow([i[0] for i in cursor.description])
count = 0
builder = []
row = cursor.fetchone()
DELIMITERS = ['\t'] * (len(row) - 1) + ['\n']
while row:
count += 1
# Add row with delimiters to builder
builder += [str(item) for pair in zip(row, DELIMITERS) for item in pair]
if count == 1000:
count = 0
f.write(''.join(builder))
builder[:] = []
row = cursor.fetchone()
f.write(''.join(builder))
Edit: The database I'm using is unique to the small company that I'm working for, so unfortunately I can't provide much information on that front. 编辑:我正在使用的数据库对于我正在工作的小型公司是唯一的,因此很遗憾,我无法在这方面提供很多信息。 I'm using jpype to connect with the database since the only means of connecting is via a jdbc driver.
我使用jpype连接数据库,因为连接的唯一方法是通过jdbc驱动程序。 I'm running cPython 2.7.5;
我正在运行cPython 2.7.5; would love to use PyPy but it doesn't work with Pandas.
很想使用PyPy,但不适用于Pandas。
Since I'm extracting such a large number of rows, I'm hesitant to use fetchall for fear that I'll run out of memory. 由于要提取大量行,因此我犹豫使用fetchall,因为担心内存不足。
row
has comparable performance and is much easier on the eyes, so I think I'll use that. row
具有可比的性能,并且在眼睛上要容易得多,所以我想我会使用它。 Thanks a bunch! 谢谢一群!
With the little you've given us to go on, it's hard to be more specific, but… 有了您给我们的一些支持,就很难再具体了,但是…
I've wrapped your code up as a function, and written three alternative versions: 我已经将您的代码包装为一个函数,并编写了三个替代版本:
def row():
with open(filename, 'w') as f:
writer = csv.writer(f, delimiter='\t')
cursor = db.execute("SELECT * FROM mytable")
writer.writerow([i[0] for i in cursor.description])
for row in cursor:
writer.writerow(row)
def rows():
with open(filename, 'w') as f:
writer = csv.writer(f, delimiter='\t')
cursor = db.execute("SELECT * FROM mytable")
writer.writerow([i[0] for i in cursor.description])
writer.writerows(cursor)
def rowsall():
with open(filename, 'w') as f:
writer = csv.writer(f, delimiter='\t')
cursor = db.execute("SELECT * FROM mytable")
writer.writerow([i[0] for i in cursor.description])
writer.writerows(cursor.fetchall())
Notice that the last one is the one you say you tried. 请注意,最后一个是您说过尝试的一个。
Now, I wrote this test driver: 现在,我编写了这个测试驱动程序:
def randomname():
return ''.join(random.choice(string.ascii_lowercase) for _ in range(30))
db = sqlite3.connect(':memory:')
db.execute('CREATE TABLE mytable (id INTEGER PRIMARY KEY AUTOINCREMENT, name VARCHAR)')
db.executemany('INSERT INTO mytable (name) VALUES (?)',
[[randomname()] for _ in range(10000)])
filename = 'db.csv'
for f in manual, row, rows, rowsall:
t = timeit.timeit(f, number=1)
print('{:<10} {}'.format(f.__name__, t))
And here are the results: 结果如下:
manual 0.055549702141433954
row 0.03852885402739048
rows 0.03992213006131351
rowsall 0.02850699401460588
So, your code takes nearly twice as long as calling fetchall
and writerows
in my test! 因此,在我的测试中,您的代码花费的时间几乎是调用
fetchall
和writerows
两倍!
When I repeat a similar test with other databases, however, rowsall
is anywhere from 20% faster to 15% slower than manual
(never 40% slower, but as much as 15%)… but row
or rows
is always significantly faster than manual
. 但是,当我对其他数据库重复进行类似的测试时,
rowsall
速度要比manual
速度快20%到15%(从不慢40%,但rowsall
可快15%)…但是row
或rows
总是比manual
速度快得多。
I think the explanation is that your custom code is significantly slower than csv.writerows
, but that in some databases, using fetchall
instead of fetchone
(or just iterating the cursor) slows things down significantly. 我认为这是因为您的自定义代码比
csv.writerows
要慢得多,但是在某些数据库中,使用fetchall
而不是fetchone
(或只是迭代游标)会大大降低速度。 The reason this isn't true with an in-memory sqlite3 database is that fetchone
is doing all of the same work as fetchall
and then feeding you the list one at a time; 对于内存中的sqlite3数据库而言,这并非如此,原因是
fetchall
与fetchone
做着所有相同的工作,然后一次向您提供列表。 with a remote database, fetchone
may do anything from fetch all the lines, to fetching a buffer at a time, to fetching a row at a time, making it potentially much slower or faster than fetchall
, depending on your data. 对于远程数据库,
fetchone
可能会执行任何操作,从获取所有行到一次获取缓冲区,一次到获取一行,这可能会比fetchall
慢或快得多,具体取决于您的数据。
But for a really useful explanation, you'd have to tell us exactly which database and library you're using (and which Python version—CPython 3.3.2's csv
module seems to be a lot faster than CPython 2.7.5's, and PyPy 2.1/2.7.2 seems to be faster than CPython 2.7.5 as well, but then either one also might run your code faster too…) and so on. 但是对于真正有用的解释,您必须准确地告诉我们您正在使用哪个数据库和库(以及哪个Python版本-CPython 3.3.2的
csv
模块似乎比CPython 2.7.5和PyPy 2.1快很多。 /2.7.2似乎也比CPython 2.7.5还要快,但是任何一个都可能也可以更快地运行您的代码……),依此类推。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.