[英]How to reduce the size of a txt file created by Python?
我在Netezza服務器上的表中有大約2M行x 70列的數值和分類數據,並希望使用Python將其轉儲到.txt文件中。 我以前用SAS做過這個,在我的測試用例中,我得到一個價值450MB的txt文件。 我使用Python並嘗試了幾件事。
# One line at a time
startTime = datetime.datetime.now().replace(microsecond=0)
cnxn = pyodbc.connect('DSN=NZ_LAB')
cursor = cnxn.cursor()
c = cursor.execute("""SELECT * FROM MYTABLE""")
with open('dump_test_pyodbc.csv','wb') as csv:
csv.write(','.join([g[0] for g in c.description])+'\n')
while 1:
a=c.fetchone()
if not a:
break
csv.write(','.join([str(g) for g in a])+'\n')
cnxn.close()
endTime = datetime.datetime.now().replace(microsecond=0)
print "Time elapsed PYODBC:", endTime - startTime
>>Time elapsed PYODBC: 0:18:20
# Use Pandas chunksize
startTime = datetime.datetime.now().replace(microsecond=0)
cnxn = pyodbc.connect('DSN=NZ_LAB')
sql = ("""SELECT * FROM MYTABLE""")
df = psql.read_sql(sql, cnxn, chunksize=1000)
for k, chunk in enumerate(df):
if k == 0:
chunk.to_csv('dump_chunk.csv',index=False,mode='w')
else:
chunk.to_csv('dump_chunk.csv',index=False,mode='a',header=False)
endTime = datetime.datetime.now().replace(microsecond=0)
print "Time elapsed PANDAS:", endTime - startTime
cnxn.close()
>>Time elapsed PANDAS: 0:29:29
現在大小:Pandas方法創建了一個價值690MB的文件,另一種方法創建了一個價值630MB的文件。 速度和尺寸似乎有利於前一種方法,但是,尺寸方面,這仍然比原來的SAS方法大得多。 有關如何改進Python方法以減少輸出大小的任何想法?
編輯:添加示例--------------------
好吧,似乎SAS在管理整數方面做得更好,這是有道理的。 我認為這是構成大小差異的最大因素。
SAS:xxxxxx,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.49,40.65,63.31, 1249.92 ...
大熊貓:xxxxxx,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.49,40.65,63.31, 1249.92 ...
fetchone():xxxxxx,0.00,0.00,0,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,2.49,40.65, 63.31,1249.92 ...
編輯2:解決方案------------------------------------
我最終刪除了不必要的小數:
csv.write(','.join([str(g.strip()) if type(g)==str else '%g'%(g) for g in a])+'\n')
這使文件大小降至SAS級別。
我打算將其作為評論,但文本格式化將有所幫助。
我的猜測是你遇到引用與未引用的CSV文件的問題。 SAS可以選擇創建不帶引號的CSV文件。 這是一個例子:
This Value,That Value,3,Other Value,423,985.32
我認為你得到的文件更准確,並且不會為包含逗號的字段帶來問題。 引用同一行:
"This Value","That Value","3","Other Value","423,985.32"
如您所見,在第一個(SAS)示例中,如果讀入電子表格,它將讀作兩個不同的值,“423”和“985.32”。 在第二個例子中,很明顯它實際上是一個值,“423,985.32”。 這就是為什么你現在得到的引用格式(如果我是對的)更准確和安全。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.