简体   繁体   中英

Load and random shuffle 8 gigabytes of csv data in Python

Essentially, I've got 8 gigabytes of CSV data and I want to shuffle it randomly so that I can do mini batches in my ML model. However, if I were to load 8gb of data straight into Python and shuffle it, there seems to have a memory problem.

But, if I load data chunk by chunk then shuffle it, then the data is still in the same pattern since it is sorted originally. This is what I've done so far.

import pandas as pd
import numpy as np

// get data with size equal to CHUNK_SIZE
reader = pd.read_csv(path , header=0, iterator=True)
data = reader.get_chunk(CHUNK_SIZE)

// randomly shuffle
data = np.random.shuffle(data)

Are there any ways that I can do it fast and efficiently? Thank you.

UPDATE: I have approximately 30,000,000 rows and it has been sorted by time.

Here's a concept...

Generate 30,000,000 line CSV with Perl - takes 11 seconds on my Mac:

perl -E 'for($i=0;$i<30000000;$i++){say "Line $i,field2,field3,",int rand 100}' > BigBoy.csv

Sample Output

Line 0,field2,field3,49
Line 1,field2,field3,6
Line 2,field2,field3,15
...
Line 29999998,field2,field3,79
Line 29999999,field2,field3,19

Take 1% of the lines and shuffle them - takes 3 seconds and 15MB of RAM:

awk 'rand()>0.99' BigBoy.csv | gshuf > RandomSet.csv

RandomSet.csv contains 299,748 lines:

Sample Output

Line 15348259,field2,field3,95
Line 1642442,field2,field3,93
Line 29199452,field2,field3,52

gshuf installed on Mac using homebrew :

brew install coreutils

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM