简体   繁体   中英

Efficient way to shuffle data from different large files

For example, what I have are df1 and df2 in different domain:

df1 = pd.DataFrame({"question":["q1","q2"], "answer":["a1","a2"], "domain":"tech"})
df2 = pd.DataFrame({"question":["q3","q4"], "answer":["a3","a4"], "domain":"history"})

print(df1)
  question answer domain
0       q1     a1   tech
1       q2     a2   tech

print(df2)
  question answer   domain
0       q3     a3  history
1       q4     a4  history

What I want is the shuffled data:

print(shuffled1)
  question answer   domain
0       q3     a3  history
1       q1     a1     tech
print(shuffled2)
  question answer   domain
0       q2     a2     tech
1       q4     a4  history

In the real world, I have 60+ csv files from different domain which have same structure. Each file have 50k records. They can not be read into memory at the same time.

What I want to do is to feed these files into a Bert model to train it, but the model will do bad if it learn the data from "history" domain for 10k steps and then learning from "tech" domain of another 10k steps. So I want to shuffle the data in the files, to make multiple domain's data evenly distributed in each file.

One answer would be to read one by one each file and spread the lines across N new files. Doing so, you will obtain "shuffled files" with a similar number of lines and with the same proportion of "original files". Of course, it depends a lot of what kind of shuffled files you would need.

The reading of initial file can be done in parallel, but we would need to coordinate the threads to not write at the same time in the same files. I won't describe that here, because I think it is too much for what is needed here. For example: Python multiprocessing safely writing to a file .

Beside the number of files you have and/or you want, the limiting part below is the shuffling. Given your question, as it is limited to files of 50k lines and machine learning, I think the procedure below is enough. An array of 50k * 10 takes around 4 Mb, so it can be entirely loaded into memory to be shuffled by np.random.shuffle . If it was much bigger, you need to use another method, see shuffle a large list of items without loading in memory .

Thus, the procedure could be:

  1. For the original file 1:
    1. Read the file
    2. Shuffle the file
    3. Divide the file into N blocks (considering that the N is higher than the number of rows)
    4. Write the blocks into the shuffled files
  2. Go to the next file and restart at 1.1.

First thing first, I generated 50 files with 100,000 lines of 25 Mb each:

import pandas as pd
import numpy as np

for i in range(50):
    arr = np.random.randint(1000, size=(100000,10))
    with open(f'bigfile-{i}', 'w') as f: np.savetxt(f, arr, delimiter=',')

That's a rough code, but it works:

originalFiles = [f'bigfile-{i}' for i in range(50)] # paths of your original files
nbShuffled = len( originalFiles ) # number of shuffled files (you can choose)

for i, file in enumerate( originalFiles ):
    # 1. Read the original file
    with open(file, 'r') as f: lines = f.readlines()
    # 2. Shuffle the file
    np.random.shuffle( lines )
    # 3. Estimate number of lines per block
    nbLines = len( lines )
    firstBlocks = int( np.floor( nbLines / nbShuffled ) )
    lastBlock = int( firstBlocks + nbLines % nbShuffled )
    blocks = [firstBlocks] * ( nbShuffled - 1 ) + [lastBlock]
    # 4. Write the blocks
    np.random.shuffle( blocks ) # avoid that the last block is always in the last shuffle file
    x = 0
    for b in range( nbShuffled ):
        with open( f'bigfile_shuffled-{i}', 'a' ) as f: f.writelines( lines[ x : x + blocks[b] ] )
        x += blocks[b]

It took ~13s to run on my computer (Linux 64 bits, 32 Go RAM, 16 CPU).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM