在Python中加载和随机随机播放8 GB的CSV数据

Question

Essentially, I've got 8 gigabytes of CSV data and I want to shuffle it randomly so that I can do mini batches in my ML model. 本质上，我有8 GB的CSV数据，我想随机对其进行混洗，以便可以在ML模型中进行小批量处理。 However, if I were to load 8gb of data straight into Python and shuffle it, there seems to have a memory problem. 但是，如果我将8gb的数据直接加载到Python中并对其进行混洗，则似乎存在内存问题。

But, if I load data chunk by chunk then shuffle it, then the data is still in the same pattern since it is sorted originally. 但是，如果我逐块加载数据然后对其进行随机排序，那么由于数据最初是按顺序排序的，因此数据仍处于相同的模式。 This is what I've done so far. 到目前为止，这是我所做的。

import pandas as pd
import numpy as np

// get data with size equal to CHUNK_SIZE
reader = pd.read_csv(path , header=0, iterator=True)
data = reader.get_chunk(CHUNK_SIZE)

// randomly shuffle
data = np.random.shuffle(data)

Are there any ways that I can do it fast and efficiently? 有什么方法可以快速有效地做到这一点？ Thank you. 谢谢。

UPDATE: I have approximately 30,000,000 rows and it has been sorted by time. 更新：我大约有30,000,000行，并且已按时间对其进行了排序。

Answer 1

Here's a concept... 这是一个概念...

Generate 30,000,000 line CSV with Perl - takes 11 seconds on my Mac: 使用Perl生成30,000,000行CSV-在Mac上需要11秒：

perl -E 'for($i=0;$i<30000000;$i++){say "Line $i,field2,field3,",int rand 100}' > BigBoy.csv

Sample Output 样本输出

Line 0,field2,field3,49
Line 1,field2,field3,6
Line 2,field2,field3,15
...
Line 29999998,field2,field3,79
Line 29999999,field2,field3,19

Take 1% of the lines and shuffle them - takes 3 seconds and 15MB of RAM: 占用1％的行并对其进行洗牌-花费3秒和15MB的RAM：

awk 'rand()>0.99' BigBoy.csv | gshuf > RandomSet.csv

RandomSet.csv contains 299,748 lines: RandomSet.csv包含299,748行：

Sample Output 样本输出

Line 15348259,field2,field3,95
Line 1642442,field2,field3,93
Line 29199452,field2,field3,52

gshuf installed on Mac using homebrew : 在Mac上使用homebrew安装的gshuf ：

brew install coreutils

在Python中加载和随机随机播放8 GB的CSV数据

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-04-06 11:40:29

在Python中加载和随机随机播放8 GB的CSV数据

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-04-06 11:40:29

解决方案1
2 已采纳 2018-04-06 11:40:29