简体   繁体   English

在Python中加载和随机随机播放8 GB的CSV数据

[英]Load and random shuffle 8 gigabytes of csv data in Python

Essentially, I've got 8 gigabytes of CSV data and I want to shuffle it randomly so that I can do mini batches in my ML model. 本质上,我有8 GB的CSV数据,我想随机对其进行混洗,以便可以在ML模型中进行小批量处理。 However, if I were to load 8gb of data straight into Python and shuffle it, there seems to have a memory problem. 但是,如果我将8gb的数据直接加载到Python中并对其进行混洗,则似乎存在内存问题。

But, if I load data chunk by chunk then shuffle it, then the data is still in the same pattern since it is sorted originally. 但是,如果我逐块加载数据然后对其进行随机排序,那么由于数据最初是按顺序排序的,因此数据仍处于相同的模式。 This is what I've done so far. 到目前为止,这是我所做的。

import pandas as pd
import numpy as np

// get data with size equal to CHUNK_SIZE
reader = pd.read_csv(path , header=0, iterator=True)
data = reader.get_chunk(CHUNK_SIZE)

// randomly shuffle
data = np.random.shuffle(data)

Are there any ways that I can do it fast and efficiently? 有什么方法可以快速有效地做到这一点? Thank you. 谢谢。

UPDATE: I have approximately 30,000,000 rows and it has been sorted by time. 更新:我大约有30,000,000行,并且已按时间对其进行了排序。

Here's a concept... 这是一个概念...

Generate 30,000,000 line CSV with Perl - takes 11 seconds on my Mac: 使用Perl生成30,000,000行CSV-在Mac上需要11秒:

perl -E 'for($i=0;$i<30000000;$i++){say "Line $i,field2,field3,",int rand 100}' > BigBoy.csv

Sample Output 样本输出

Line 0,field2,field3,49
Line 1,field2,field3,6
Line 2,field2,field3,15
...
Line 29999998,field2,field3,79
Line 29999999,field2,field3,19

Take 1% of the lines and shuffle them - takes 3 seconds and 15MB of RAM: 占用1%的行并对其进行洗牌-花费3秒和15MB的RAM:

awk 'rand()>0.99' BigBoy.csv | gshuf > RandomSet.csv

RandomSet.csv contains 299,748 lines: RandomSet.csv包含299,748行:

Sample Output 样本输出

Line 15348259,field2,field3,95
Line 1642442,field2,field3,93
Line 29199452,field2,field3,52

gshuf installed on Mac using homebrew : 在Mac上使用homebrew安装的gshuf

brew install coreutils

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM