简体   繁体   English

切片音频信号以检测音高

[英]Slicing audio signal to detect pitch

I am using Librosa to transcribe monophonic guitar audio signals. 我正在使用Librosa转录单声道吉他音频信号。

I thought that, it would be a good start to "slice" the signal depending on the onset times, to detect note changes at the correct time. 我认为,这是一个很好的开始,可以根据发作时间对信号进行“切片”,以便在正确的时间检测音符的变化。

Librosa provides a function that detects the local minima before the onset times. Librosa提供了在发病时间之前检测局部最小值的功能 I checked those timings and they are correct. 我检查了这些时间,它们是正确的。

Here is the waveform of the original signal and the times of the minima. 这是原始信号的波形和最小值的时间。

[ 266240  552960  840704 1161728 1427968 1735680 1994752]

波形

The melody played is E4, F4, F#4 ..., B4. 演奏的旋律是E4,F4,F#4 ...,B4。

Therefore the results should ideally be: 330Hz, 350Hz, ..., 493Hz (approximately). 因此,理想情况下,结果应为:330Hz,350Hz,...,493Hz(大约)。

As you can see, the times in the minima array, represent the time just before the note was played. 如您所见, minima数组中的时间表示刚演奏音符之前的时间。

However, on a sliced signal (of 10-12 seconds with only one note per slice), my frequency detection methods have really poor results. 但是,在切片的信号(10-12秒,每个切片只有一个音符)上,我的频率检测方法的结果确实很差。 I am confused because I can't see any bugs in my code: 我很困惑,因为我在代码中看不到任何错误:

  y, sr = librosa.load(filename, sr=40000)

  onset_frames = librosa.onset.onset_detect(y=y, sr=sr)
  oenv = librosa.onset.onset_strength(y=y, sr=sr)

  onset_bt = librosa.onset.onset_backtrack(onset_frames, oenv)

  # Converting those times from frames to samples.
  new_onset_bt = librosa.frames_to_samples(onset_bt)

  slices = np.split(y, new_onset_bt[1:])
  for i in range(0, len(slices)):
    print freq_from_hps(slices[i], 40000)
    print freq_from_autocorr(slices[i], 40000)
    print freq_from_fft(slices[i], 40000)

Where the freq_from functions are taken directly from here . freq_from函数直接从此处获取

I would assume this is just bad precision from the methods, but I get some crazy results. 我认为这只是方法的精度差,但是我得到了一些疯狂的结果。 Specifically, freq_from_hps returns: 具体来说, freq_from_hps返回:

1.33818658287
1.2078047577
0.802142642257
0.531096911977
0.987532329094
0.559638134414
0.953497587952
0.628980979055

These values are supposed to be the 8 pitches of the 8 corresponding slices (in Hz!). 这些值应该是8个相应切片中的8个音高(以Hz为单位!)。

freq_from_fft returns similar values whereas freq_from_autocorr returns some more "normal" values but also some random values near 10000Hz: freq_from_fft返回相似的值,而freq_from_autocorr返回更多“正常”值,但也返回一些接近freq_from_autocorr随机值:

242.748000585
10650.0394232
275.25299319
145.552578747
154.725859019
7828.70876515
174.180627765
183.731497068

This is the spectrogram from the whole signal: 这是整个信号的频谱图:

全谱图

And this is, for example, the spectrogram of slice 1 (the E4 note): 例如,这是切片1的声谱图(E4音符): 频谱图4

As you can see, the slicing has been done correctly. 如您所见,切片已正确完成。 However there are several issues. 但是,有几个问题。 First, there is an octave issue in the spectrogram. 首先,频谱图中存在一个八度音阶问题。 I was expecting some issues with that. 我期待与此有关的一些问题。 However, the results I get from the 3 methods mentioned above are just very weird. 但是,我从上述3种方法获得的结果非常奇怪。

Is this an issue with my signal processing understanding or my code? 这是我对信号处理的理解还是我的代码有问题?

Is this an issue with my signal processing understanding or my code? 这是我对信号处理的理解还是我的代码有问题?

Your code looks fine to me. 您的代码对我来说很好。

The frequencies you want to detect are the fundamental frequencies of your pitches (the problem is also known as "f0 estimation"). 您要检测的频率是音调的基本频率(该问题也称为“ f0估计”)。

So before using something like freq_from_fft I'd bandpass filter the signal to get rid of garbage transients and low frequency noise—the stuff that's in the signal, but irrelevant to your problem. 因此,在使用freq_from_fft东西之前,我会对信号进行带通滤波,以消除垃圾瞬态和低频噪声(信号中存在的东西,但与您的问题无关)。

Think about, which range your fundamental frequencies are going to be in. For an acoustic guitar that's E2 (82 Hz) to F6 (1,397 Hz). 考虑一下,您的基本频率将在哪个范围内。对于E2(82 Hz)到F6(1,397 Hz)的原声吉他。 That means you can get rid of anything below ~80 Hz and above ~1,400 Hz (for a bandpass example, see here ). 这意味着您可以摆脱〜80 Hz以下和〜1,400 Hz以上的任何东西(有关带通示例,请参见此处 )。 After filtering, do your peak detection to find the pitches (assuming the fundamental actually has the most energy). 滤波后,进行峰值检测以找到音高(假设基波实际上具有最大的能量)。

Another strategy might be, to ignore the first X samples of each slice, as they tend to be percussive and not harmonic in nature and won't give you much information anyway. 另一种策略可能是,忽略每个切片的前X样本,因为它们往往是打击乐的,本质上不是谐波,反正不会给您太多信息。 So, of your slices, just look at the last ~90% of your samples. 因此,在您的切片中,仅需查看样本的最后90%。

That all said, there is a large body of work for f0 or fundamental frequency estimation. 综上所述,f0或基频估计工作量很大。 A good starting point are ISMIR papers. ISMIR论文是一个很好的起点。

Last, but not least, Librosa's piptrack function may do just what you want. 最后但并非最不重要的一点是 ,Librosa的piptrack函数可能会执行您想要的操作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM