對於大量數據，matplotlib散點圖會變慢嗎？

Question

我有一個包含屬性x，y的數據集，它們可以在xy曲面上繪制。

最初，我使用代碼

df.plot(kind='scatter', x='x', y='y', alpha=0.10, s=2)
plt.gca().set_aspect('equal')

代碼非常快，數據大小約為50000。

最近，我使用了一個新的數據集，其大小約為2500000。並且散點圖變得越來越慢。

我想知道這是否是預期的行為，是否有什么辦法可以提高繪圖速度？

Answer 1

是的。 這樣做的原因是散點圖可能超過一千點幾乎沒有意義，因此沒有人願意對其進行優化。 使用其他一些表示形式的數據會更好：

如果您的點分布在整個地方的熱點圖。 使熱圖單元非常小
繪制某種近似於分布的曲線，也許將y與x相關聯。 確保提供一些置信度值或以其他方式描述分布； 對我來說，例如，建立一個箱與晶須的y每一個x （或一定范圍的x ），並將其放置在同一網格通常工作得很好。
減少數據集。 @sascha在評論中建議隨機抽樣，這絕對是個好主意。 根據您的數據，也許有更好的方法來選擇代表點。

Answer 2

我從降維算法中獲得了超過300k 2D坐標的相同問題，解決方案是將坐標轉換為2D numpy數組並將其可視化為圖像的近似方法。 結果非常好，而且速度更快：

def plot_to_buf(data, height=2800, width=2800, inc=0.3):
    xlims = (data[:,0].min(), data[:,0].max())
    ylims = (data[:,1].min(), data[:,1].max())
    dxl = xlims[1] - xlims[0]
    dyl = ylims[1] - ylims[0]

    print('xlims: (%f, %f)' % xlims)
    print('ylims: (%f, %f)' % ylims)

    buffer = np.zeros((height+1, width+1))
    for i, p in enumerate(data):
        print('\rloading: %03d' % (float(i)/data.shape[0]*100), end=' ')
        x0 = int(round(((p[0] - xlims[0]) / dxl) * width))
        y0 = int(round((1 - (p[1] - ylims[0]) / dyl) * height))
        buffer[y0, x0] += inc
        if buffer[y0, x0] > 1.0: buffer[y0, x0] = 1.0
    return xlims, ylims, buffer

data = load_data() # data.shape = (310216, 2) <<< your data here
xlims, ylims, I = plot_to_buf(data, height=h, width=w, inc=0.3)
ax_extent = list(xlims)+list(ylims)
plt.imshow(I,
           vmin=0,
           vmax=1, 
           cmap=plt.get_cmap('hot'),
           interpolation='lanczos',
           aspect='auto',
           extent=ax_extent
           )
plt.grid(alpha=0.2)
plt.title('Latent space')
plt.colorbar()

結果如下：

我希望這可以幫助你。

對於大量數據，matplotlib散點圖會變慢嗎？

問題描述

2 個解決方案

解決方案1
3 2017-03-07 02:55:05

解決方案2
2 2018-10-22 02:06:14

對於大量數據，matplotlib散點圖會變慢嗎？

問題描述

2 個解決方案

解決方案1 3 2017-03-07 02:55:05

解決方案2 2 2018-10-22 02:06:14

解決方案1
3 2017-03-07 02:55:05

解決方案2
2 2018-10-22 02:06:14