简体   繁体   中英

How to deal with different audio formats for audio classification?

I am working on an audio classification problem statement to classify between two audio classes. I have collected samples from jotform , they are providing audio widget to collect.wav audio but it turned out that widget is storing data in .mp3 format :

In my problem statement, Classification classes are from different formats:

class A : all the 100 samples are in .mp3 format ( jot form collection )
class B : all the samples are in .wav format

I am adding both types of classes' sample here:

Class A sample audio : it's in.wav format

Details:

General
Complete name                            : count_class_1.wav
Format                                   : Wave
File size                                : 1.41 MiB
Duration                                 : 15 s 445 ms
Overall bit rate mode                    : Constant
Overall bit rate                         : 768 kb/s

Audio
Format                                   : PCM
Format settings                          : Little / Signed
Codec ID                                 : 1
Duration                                 : 15 s 445 ms
Bit rate mode                            : Constant
Bit rate                                 : 768 kb/s
Channel(s)                               : 1 channel
Sampling rate                            : 48.0 kHz
Bit depth                                : 16 bits
Stream size                              : 1.41 MiB (100%)

Class B sample audio Jotform says it's.wav format but only extension is.wav, file is.mp3 format.

Details:

General
Complete name                            : count.wav
Format                                   : MPEG Audio
File size                                : 183 KiB
Duration                                 : 9 s 360 ms
Overall bit rate mode                    : Constant
Overall bit rate                         : 160 kb/s
Writing library                          : LAME3.99.5
FileExtension_Invalid                    : m1a mpa mpa1 mp1 m2a mpa2 mp2 mp3

Audio
Format                                   : MPEG Audio
Format version                           : Version 1
Format profile                           : Layer 3
Format settings                          : Joint stereo / MS Stereo
Duration                                 : 9 s 360 ms
Bit rate mode                            : Constant
Bit rate                                 : 160 kb/s
Channel(s)                               : 2 channels
Sampling rate                            : 48.0 kHz
Frame rate                               : 41.667 FPS (1152 SPF)
Compression mode                         : Lossy
Stream size                              : 183 KiB (100%)
Writing library                          : LAME3.99.5

What i am doing before feeding it to neural network:

  1. Downsampled to 16kHz, the level of the signal was normalized
  2. Segmented in audio segments, by removing the silences in the signal
  3. High filtered (pre-emphasis filter). Audio segments were then divided in non-overlapping Hamming-windowed frames of 25ms.

Now after this extracting various features from each frames including MFCCs, Zero-crossing rate (ZCR), Formants (the first 4) etc and at last feeding all these features to simple dense layer neural network or CNN (spectrogram format).

But the problem is both classes' audio files are in a different format class A audio samples are in.wav and class B audio samples in.mp3 and there are high chances that network can be biased towards format or audio encoding.

Solutions I have thought:

  1. Downgrade all files to 16kHz frequency ( But format issue is still there)
  2. or convert all files into one universal format, for example I am converting all.mp3 files to.wav files then all files will be having same format, I could convert one into another, but I am afraid I will lose quality on the converted files.

My doubt is if I downsampled both classes audio samples (.wav and mp3 both) to 16kHz will my neural network still be format biased?

What would be a good strategy for me for Audio classification when audio files are in different formats?

Converting from MP3 to Linear PCM alone won't remove encoding artifacts which may be "learned" by your neural network. Since MP3 is the lossy format in question, the natural approach would be to apply the same codec to your WAVE 16-bit Linear PCM files and work with both classes encoded-decoded in MP3.

However, the codec alone might not be the only unintended discriminator of your classes. Aside from double-checking the audio capture implementation from jotform, you could also apply data augmentation techniques like ones available in the audiomentations project.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM