简体   繁体   English

如何处理不同的音频格式进行音频分类?

[英]How to deal with different audio formats for audio classification?

I am working on an audio classification problem statement to classify between two audio classes.我正在研究音频分类问题陈述以在两个音频类之间进行分类。 I have collected samples from jotform , they are providing audio widget to collect.wav audio but it turned out that widget is storing data in .mp3 format :我从jotform收集了样本,他们为 collect.wav 音频提供了音频小部件,但结果证明小部件以.mp3 格式存储数据:

In my problem statement, Classification classes are from different formats:在我的问题陈述中,分类类来自不同的格式:

class A : all the 100 samples are in .mp3 format ( jot form collection )
class B : all the samples are in .wav format

I am adding both types of classes' sample here:我在这里添加两种类型的类的示例:

Class A sample audio : it's in.wav format Class 示例音频:它是 in.wav 格式

Details:细节:

General
Complete name                            : count_class_1.wav
Format                                   : Wave
File size                                : 1.41 MiB
Duration                                 : 15 s 445 ms
Overall bit rate mode                    : Constant
Overall bit rate                         : 768 kb/s

Audio
Format                                   : PCM
Format settings                          : Little / Signed
Codec ID                                 : 1
Duration                                 : 15 s 445 ms
Bit rate mode                            : Constant
Bit rate                                 : 768 kb/s
Channel(s)                               : 1 channel
Sampling rate                            : 48.0 kHz
Bit depth                                : 16 bits
Stream size                              : 1.41 MiB (100%)

Class B sample audio Jotform says it's.wav format but only extension is.wav, file is.mp3 format. Class B 示例音频Jotform 说它是.wav 格式,但只有扩展名是.wav,文件是.mp3 格式。

Details:细节:

General
Complete name                            : count.wav
Format                                   : MPEG Audio
File size                                : 183 KiB
Duration                                 : 9 s 360 ms
Overall bit rate mode                    : Constant
Overall bit rate                         : 160 kb/s
Writing library                          : LAME3.99.5
FileExtension_Invalid                    : m1a mpa mpa1 mp1 m2a mpa2 mp2 mp3

Audio
Format                                   : MPEG Audio
Format version                           : Version 1
Format profile                           : Layer 3
Format settings                          : Joint stereo / MS Stereo
Duration                                 : 9 s 360 ms
Bit rate mode                            : Constant
Bit rate                                 : 160 kb/s
Channel(s)                               : 2 channels
Sampling rate                            : 48.0 kHz
Frame rate                               : 41.667 FPS (1152 SPF)
Compression mode                         : Lossy
Stream size                              : 183 KiB (100%)
Writing library                          : LAME3.99.5

What i am doing before feeding it to neural network:在将其输入神经网络之前我在做什么:

  1. Downsampled to 16kHz, the level of the signal was normalized下采样到 16kHz,信号电平被归一化
  2. Segmented in audio segments, by removing the silences in the signal通过去除信号中的静音,在音频段中进行分段
  3. High filtered (pre-emphasis filter).高滤波(预加重滤波器)。 Audio segments were then divided in non-overlapping Hamming-windowed frames of 25ms.然后将音频片段划分为 25 毫秒的非重叠汉明窗帧。

Now after this extracting various features from each frames including MFCCs, Zero-crossing rate (ZCR), Formants (the first 4) etc and at last feeding all these features to simple dense layer neural network or CNN (spectrogram format).现在,从每个帧中提取各种特征,包括 MFCC、过零率 (ZCR)、共振峰(前 4 个)等,最后将所有这些特征馈送到简单的密集层神经网络或 CNN(频谱图格式)。

But the problem is both classes' audio files are in a different format class A audio samples are in.wav and class B audio samples in.mp3 and there are high chances that network can be biased towards format or audio encoding.但问题是两个类的音频文件格式不同 class A 音频样本在.wav 中,class B 音频样本在.mp3 中,网络很可能会偏向格式或音频编码。

Solutions I have thought:我想到的解决方案:

  1. Downgrade all files to 16kHz frequency ( But format issue is still there)将所有文件降级到 16kHz 频率(但格式问题仍然存在)
  2. or convert all files into one universal format, for example I am converting all.mp3 files to.wav files then all files will be having same format, I could convert one into another, but I am afraid I will lose quality on the converted files.或将所有文件转换为一种通用格式,例如我将所有.mp3 文件转换为.wav 文件然后所有文件将具有相同的格式,我可以将一种转换为另一种,但我担心转换后的文件会丢失质量.

My doubt is if I downsampled both classes audio samples (.wav and mp3 both) to 16kHz will my neural network still be format biased?我的疑问是,如果我将两类音频样本(.wav 和 mp3 两者)都下采样到 16kHz,我的神经网络仍然会存在格式偏差吗?

What would be a good strategy for me for Audio classification when audio files are in different formats?当音频文件采用不同格式时,对我来说什么是音频分类的好策略?

Converting from MP3 to Linear PCM alone won't remove encoding artifacts which may be "learned" by your neural network.仅从 MP3 转换为线性 PCM 并不会消除神经网络可能“学习”到的编码伪影。 Since MP3 is the lossy format in question, the natural approach would be to apply the same codec to your WAVE 16-bit Linear PCM files and work with both classes encoded-decoded in MP3.由于 MP3 是有问题的有损格式,自然的方法是将相同的编解码器应用于您的 WAVE 16 位线性 PCM 文件,并使用 MP3 中编码解码的两个类。

However, the codec alone might not be the only unintended discriminator of your classes.但是,编解码器本身可能不是您的类的唯一意外鉴别器。 Aside from double-checking the audio capture implementation from jotform, you could also apply data augmentation techniques like ones available in the audiomentations project.除了从 jotform 仔细检查音频捕获实现之外,您还可以应用像audiomentations项目中可用的数据增强技术。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM