哪个频谱图最能代表基于CNN的模型的音频文件的功能？

Question

I am looking to understand various spectrograms for audio analysis. 我希望了解用于音频分析的各种频谱图。 I want to convert an audio file into 10 second chunks, generate spectrograms for each and use a CNN model to train on top of those images to see if they are good or bad. 我想将音频文件转换为10秒的块，为每个块生成频谱图，然后使用CNN模型在这些图像之上进行训练，以查看它们的好坏。

I have looked at linear, log, mel, etc and read somewhere that mel based spectrogram is best to be used for this. 我看了线性，对数，梅尔等，并在某处阅读了基于梅尔的频谱图最好用于此情况。 But with no proper verifiable information. 但是没有适当的可验证信息。 I have used the simple following code to generate mel spectrogram. 我使用了以下简单的代码来生成梅尔频谱图。

y,sr= librosa.core.load(r'C:\Users\Tej\Desktop\NoiseWork\NoiseOnly\song.wav')
S = librosa.feature.melspectrogram(y=y, sr=sr)
librosa.display.specshow(librosa.power_to_db(S, ref=np.max))

My question is which spectrogram best represents features of an audio file for training with CNN? 我的问题是，哪个频谱图最能代表使用CNN训练的音频文件的功能？ I have used linear but some audio files the linear spectrogram seems to be the same 我使用了线性，但某些音频文件的线性频谱图似乎相同

Answer 1

Log-scaled mel-spectrograms is the current "standard" for use with Convolutional Neural Networks. 对数比例的梅尔频谱图是当前与卷积神经网络一起使用的“标准”。 It was the most commonly used in Audio Event Detection and Audio Scene Classification literature between 2015-2018. 它是2015-2018年间音频事件检测和音频场景分类文献中最常用的。

To be more invariant to amplitude changes, normalized is usually applied. 为了使振幅变化更加不变，通常应用归一化。 Either to entire clips or the windows being classified. 要么是整个剪辑，要么是对窗口进行分类。 Mean/std normalization works fine, generally. 一般而言，均值/标准差归一化工作正常。

But from the perspective of a CNN, there is relatively small difference between the different spectrometer variations. 但是从CNN的角度来看，不同光谱仪之间的差异相对较小。 So this is unlikely to fix your issue if two or more spectrograms are basically the same. 因此，如果两个或两个以上的频谱图基本相同，这不太可能解决您的问题。

Answer 2

To add to what has been stated, I recommend reading through A Comparison of Audio Signal Preprocessing Methods for Deep Neural Networks on Music Tagging by Keunwoo Choi, György Fazekas, Kyunghyun Cho, and Mark Sandler. 为了补充说明，我建议您阅读Keunwoo Choi，GyörgyFazekas，Kyunghyun Cho和Mark Sandler所著《关于音乐标签的深度神经网络的音频信号预处理方法的比较》。

For their data, they achieved nearly identical classification accuracy between simple STFTs and melspectrograms. 对于他们的数据，他们在简单的STFT和质谱图之间实现了几乎相同的分类精度。 So melspectrograms seem to be the clear winner for dimension reduction if you don't mind the preprocessing. 因此，如果您不介意进行预处理，则质谱图显然是缩小尺寸的明显赢家。 The authors also found, as jonner mentions, that log-scaling (essentially converting amplitude to a db scale) improves accuracy. 作者还发现，正如jonner所提到的，对数缩放（本质上将幅度转换为db缩放）提高了准确性。 You can easily do this with Librosa (using your code) like this: 您可以使用Librosa轻松地执行此操作（使用您的代码），如下所示：

y,sr= librosa.core.load(r'C:\Users\Tej\Desktop\NoiseWork\NoiseOnly\song.wav')
S = librosa.feature.melspectrogram(y=y, sr=sr)
S_db = librosa.core.power_to_db(S)

As for normalization after db-scaling, that seems hit or miss depending on your data. 至于db缩放后的规范化，这似乎取决于您的数据。 From the paper above, the authors found nearly no difference using various normalization techniques for their data. 从上面的论文中，作者发现使用各种标准化技术对其数据几乎没有区别。

One last thing that should be mentioned is a somewhat new method called Per-Channel Energy Normalization. 最后要提到的是一种称为“每通道能量归一化”的新方法。 I recommend reading Per-Channel Energy Normalization: Why and How by Vincent Lostanlen, Justin Salamon, Mark Cartwright, Brian McFee, Andrew Farnsworth, Steve Kelling, and Juan Pablo Bello. 我建议阅读Vincent Lostanlen，Justin Salamon，Mark Cartwright，Brian McFee，Andrew Farnsworth，Steve Kelling和Juan Pablo Bello撰写的《每通道能量归一化：为什么和如何》。 Unfortunately, there are some parameters that need adjusting depending on the data, but in many cases seems to do as well as or better than logmelspectrograms. 不幸的是，有些参数需要根据数据进行调整，但在许多情况下似乎要好于对数声谱图。 You can implement it in Librosa like this: 您可以像下面这样在Librosa中实现它：

y,sr= librosa.core.load(r'C:\Users\Tej\Desktop\NoiseWork\NoiseOnly\song.wav')
S = librosa.feature.melspectrogram(y=y, sr=sr)
S_pcen = librosa.pcen(S)

Although, like I mentioned, there are parameters within pcen that need adjusting! 尽管，正如我提到的那样，在pcen中有一些参数需要调整！ Here is Librosa's documentation on PCEN to get you started if you are interested. 如果您有兴趣，这里是Librosa的PCEN文档，可以帮助您入门。

哪个频谱图最能代表基于CNN的模型的音频文件的功能？

问题描述

2 个解决方案

解决方案1
1 2019-04-05 21:11:10

解决方案2
1 已采纳 2019-06-23 21:39:30

哪个频谱图最能代表基于CNN的模型的音频文件的功能？

问题描述

2 个解决方案

解决方案1 1 2019-04-05 21:11:10

解决方案2 1 已采纳 2019-06-23 21:39:30

解决方案1
1 2019-04-05 21:11:10

解决方案2
1 已采纳 2019-06-23 21:39:30