將numpy數組寫入文本文件的速度

Question

我需要在文本文件中寫一個非常“高”的雙列數組，這非常慢。 我發現如果我將陣列重塑為更寬的陣列，寫入速度會更快。 例如

import time
import numpy as np
dataMat1 = np.random.rand(1000,1000)
dataMat2 = np.random.rand(2,500000)
dataMat3 = np.random.rand(500000,2)
start = time.perf_counter()
with open('test1.txt','w') as f:
    np.savetxt(f,dataMat1,fmt='%g',delimiter=' ')
end = time.perf_counter()
print(end-start)

start = time.perf_counter()
with open('test2.txt','w') as f:
    np.savetxt(f,dataMat2,fmt='%g',delimiter=' ')
end = time.perf_counter()
print(end-start)

start = time.perf_counter()
with open('test3.txt','w') as f:
    np.savetxt(f,dataMat3,fmt='%g',delimiter=' ')
end = time.perf_counter()
print(end-start)

由於三個數據矩陣中的元素數量相同，為什么最后一個元素比其他兩個元素更耗時？ 有沒有辦法加快“高”數據陣列的寫入？

Answer 1

正如hpaulj指出的那樣， savetxt 循環遍歷X行並分別格式化每一行：

for row in X:
    try:
        v = format % tuple(row) + newline
    except TypeError:
        raise TypeError("Mismatch between array dtype ('%s') and "
                        "format specifier ('%s')"
                        % (str(X.dtype), format))
    fh.write(v)

我認為這里的主要時間殺手是所有字符串插值調用。 如果我們將所有字符串插值打包到一個調用中，事情就會快得多：

with open('/tmp/test4.txt','w') as f:
    fmt = ' '.join(['%g']*dataMat3.shape[1])
    fmt = '\n'.join([fmt]*dataMat3.shape[0])
    data = fmt % tuple(dataMat3.ravel())
    f.write(data)

import io
import time
import numpy as np

dataMat1 = np.random.rand(1000,1000)
dataMat2 = np.random.rand(2,500000)
dataMat3 = np.random.rand(500000,2)
start = time.perf_counter()
with open('/tmp/test1.txt','w') as f:
    np.savetxt(f,dataMat1,fmt='%g',delimiter=' ')
end = time.perf_counter()
print(end-start)

start = time.perf_counter()
with open('/tmp/test2.txt','w') as f:
    np.savetxt(f,dataMat2,fmt='%g',delimiter=' ')
end = time.perf_counter()
print(end-start)

start = time.perf_counter()
with open('/tmp/test3.txt','w') as f:
    np.savetxt(f,dataMat3,fmt='%g',delimiter=' ')
end = time.perf_counter()
print(end-start)

start = time.perf_counter()
with open('/tmp/test4.txt','w') as f:
    fmt = ' '.join(['%g']*dataMat3.shape[1])
    fmt = '\n'.join([fmt]*dataMat3.shape[0])
    data = fmt % tuple(dataMat3.ravel())        
    f.write(data)
end = time.perf_counter()
print(end-start)

報告

0.1604848340011813
0.17416274400056864
0.6634929459996783
0.16207673999997496

Answer 2

savetxt的代碼是Python並且可以訪問。 基本上它為每行/每行進行格式化寫入。 實際上它確實如此

for row in arr:
   f.write(fmt%tuple(row))

其中fmt來自你的fmt和數組的形狀，例如

'%g %g %g ...'

所以它正在為數組的每一行寫一個文件。 行格式也需要一些時間，但它是在內存中使用Python代碼完成的。

我希望loadtxt/genfromtxt會顯示相同的時間模式 - 讀取很多行需要更長的時間。

pandas有更快的csv負載。 我沒有看到任何關於其寫入速度的討論。

將numpy數組寫入文本文件的速度

問題描述

2 個解決方案

解決方案1
4 已采納 2018-12-17 19:44:41

解決方案2
3 2018-12-17 18:19:35

將numpy數組寫入文本文件的速度

問題描述

2 個解決方案

解決方案1 4 已采納 2018-12-17 19:44:41

解決方案2 3 2018-12-17 18:19:35

解決方案1
4 已采納 2018-12-17 19:44:41

解決方案2
3 2018-12-17 18:19:35