簡體   English   中英

如何減小Python創建的txt文件的大小?

[英]How to reduce the size of a txt file created by Python?

我在Netezza服務器上的表中有大約2M行x 70列的數值和分類數據,並希望使用Python將其轉儲到.txt文件中。 我以前用SAS做過這個,在我的測試用例中,我得到一個價值450MB的txt文件。 我使用Python並嘗試了幾件事。

# One line at a time

startTime = datetime.datetime.now().replace(microsecond=0)

cnxn = pyodbc.connect('DSN=NZ_LAB')
cursor = cnxn.cursor()
c = cursor.execute("""SELECT * FROM MYTABLE""")

with open('dump_test_pyodbc.csv','wb') as csv:
    csv.write(','.join([g[0] for g in c.description])+'\n')
    while 1:
        a=c.fetchone()
        if not a:
            break
        csv.write(','.join([str(g) for g in a])+'\n')
cnxn.close()

endTime = datetime.datetime.now().replace(microsecond=0)
print "Time elapsed PYODBC:", endTime - startTime

>>Time elapsed PYODBC: 0:18:20



# Use Pandas chunksize
startTime = datetime.datetime.now().replace(microsecond=0)
cnxn = pyodbc.connect('DSN=NZ_LAB')

sql = ("""SELECT * FROM MYTABLE""")

df = psql.read_sql(sql, cnxn, chunksize=1000)

for k, chunk in enumerate(df):
    if k == 0:
        chunk.to_csv('dump_chunk.csv',index=False,mode='w')
    else:
        chunk.to_csv('dump_chunk.csv',index=False,mode='a',header=False)

endTime = datetime.datetime.now().replace(microsecond=0)
print "Time elapsed PANDAS:", endTime - startTime
cnxn.close()

>>Time elapsed PANDAS: 0:29:29

現在大小:Pandas方法創建了一個價值690MB的文件,另一種方法創建了一個價值630MB的文件。 速度和尺寸似乎有利於前一種方法,但是,尺寸方面,這仍然比原來的SAS方法大得多。 有關如何改進Python方法以減少輸出大小的任何想法?

編輯:添加示例--------------------

好吧,似乎SAS在管理整數方面做得更好,這是有道理的。 我認為這是構成大小差異的最大因素。

SAS:xxxxxx,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.49,40.65,63.31, 1249.92 ...

大熊貓:xxxxxx,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.49,40.65,63.31, 1249.92 ...

fetchone():xxxxxx,0.00,0.00,0,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,2.49,40.65, 63.31,1249.92 ...

編輯2:解決方案------------------------------------

我最終刪除了不必要的小數:

csv.write(','.join([str(g.strip()) if type(g)==str else '%g'%(g) for g in a])+'\n')

這使文件大小降至SAS級別。

我打算將其作為評論,但文本格式化將有所幫助。

我的猜測是你遇到引用與未引用的CSV文件的問題。 SAS可以選擇創建不帶引號的CSV文件。 這是一個例子:

This Value,That Value,3,Other Value,423,985.32

我認為你得到的文件更准確,並且不會為包含逗號的字段帶來問題。 引用同一行:

"This Value","That Value","3","Other Value","423,985.32"

如您所見,在第一個(SAS)示例中,如果讀入電子表格,它將讀作兩個不同的值,“423”和“985.32”。 在第二個例子中,很明顯它實際上是一個值,“423,985.32”。 這就是為什么你現在得到的引用格式(如果我是對的)更准確和安全。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM