简体   繁体   中英

Data Augmentation: What proportion of training dataset needs to be augmented?

I am currently working on a speech classification problem. I have 1000 audio files in each class and have 7 such classes. I need to augment data to achieve better accuracy. I am using librosa library for data augmentation. For every audio file, I am using the below code.

fbank_train = []
labels_train = []
for wav in x_train_one[:len(x_train_one)]:
    samples, sample_rate = librosa.load(wav, sr=16000)
    if (len(samples)) == 16000:
        label = wav.split('/')[6]
        fbank = logfbank(samples, sample_rate, nfilt=16)
        fbank_train.append(fbank)
        labels_train.append(label)
        y_shifted = librosa.effects.pitch_shift(samples, sample_rate, n_steps=4, bins_per_octave=24)
        fbank_y_shifted = logfbank(y_shifted, sample_rate, nfilt=16)
        fbank_train.append(fbank_y_shifted)
        labels_train.append(label)
        change_speed = librosa.effects.time_stretch(samples, rate=0.75)
        if(len(change_speed)>=16000):
            change_speed = change_speed[:16000]
            fbank_change_speed = logfbank(change_speed, sample_rate, nfilt=16)
            fbank_train.append(fbank_change_speed)
            labels_train.append(label)
        change_speedp = librosa.effects.time_stretch(samples, rate=1.25)
        if(len(change_speedp)<=16000):
            change_speedp = np.pad(change_speedp, (0, max(0, 16000 - len(change_speedp))), "constant")
            fbank_change_speedp = logfbank(change_speedp, sample_rate, nfilt=16)
            fbank_train.append(fbank_change_speedp)
            labels_train.append(label)

That is I am augmentating each audio file (pitch-shifting and time-shifting). I would like to know, is this the correct way of augmentation of training dataset? And if not, what is the proportion of audio files that need to be augmented?

The most common way of performing augmentation is doing it to the whole dataset with a random chance for each sample to be augmented or not.

Also in most cases, the augmentation is done during runtime .

For example a pseudocode for your case could look like:

for e in epochs:
    reshuffle_training_set
    for x, y in training_set:
        if np.random.random() > 0.5:
            x = randomly_shift_pitch(x)
        if np.random.random() > 0.5:
            x = randomly_shift_time(x)
        model.fit(x, y)

This means that each image has a 25% chance of not being augmented at all, a 25% chance of being only time-shifted, a 25% chance of being only pitch-shifted and a 25% chance of being both time and pitch-shifted.

During the next epoch, that same image is augmented again with the above strategies. If you train your model through multiple epochs, each image will pass through each combination of augmentations (with a high probability), so the model will learn from them all.

Also if each of the shifts is done randomly, even if a sample passed through the same augmentor twice, it wouldn't result in the same perturbed sample.

A benefit of augmenting the images during runtime and not performing the full augmentation beforehand is that if you wanted the same result, you'd need to create multiple new datasets (ie a few time-shifted ones, pitch-shifted ones and combinations of both) and train the model on the combined large dataset.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM