简体   繁体   English

Librosa 音高跟踪 - STFT

[英]Librosa pitch tracking - STFT

I am using this algorithm to detect the pitch of this audio file.我正在使用这个算法来检测这个音频文件的音高。 As you can hear, it is an E2 note played on a guitar with a bit of noise in the background.正如你所听到的,这是用吉他演奏的 E2 音符,背景中有一点噪音。

I generated this spectrogram using STFT:我使用 STFT 生成了这个频谱图: 频谱图

And I am using the algorithm linked above like this:我正在使用上面链接的算法,如下所示:

y, sr = librosa.load(filename, sr=40000)
pitches, magnitudes = librosa.core.piptrack(y=y, sr=sr, fmin=75, fmax=1600)

np.set_printoptions(threshold=np.nan)
print pitches[np.nonzero(pitches)]

As a result, I am getting pretty much every possible frequency between my fmin and fmax .结果,我在fminfmax之间获得了几乎所有可能的频率。 What do I have to do with the output of the piptrack method to discover the fundamental frequency of a time frame?我与piptrack方法的输出有什么关系才能发现时间框架的基频?

UPDATE更新

I am still not sure what those 2D array represents, though.不过,我仍然不确定那些二维数组代表什么。 Let's say I want to find out how strong is 82Hz in frame 5. I could do that using the STFT function which simply returns a 2D matrix (which was used to plot the spectrogram).假设我想知道第 5 帧中 82Hz 的强度有多大。我可以使用 STFT 函数来做到这一点,该函数只返回一个 2D 矩阵(用于绘制频谱图)。

However, piptrack does something additional which could be useful and I don't really understand what.但是, piptrack做了一些额外的事情,这可能是有用的,我真的不明白是什么。 pitches[f, t] contains instantaneous frequency at bin f, time t . pitches[f, t] contains instantaneous frequency at bin f, time t Does that mean that, if I want to find the maximum frequency at time frame t, I have to:这是否意味着,如果我想在时间帧 t 找到最大频率,我必须:

  1. Go to the magnitudes[][t] array, find the bin with the maximum magnitude.转到magnitudes[][t]数组,找到最大幅度的bin。
  2. Assign the bin to a variable f .将 bin 分配给变量f
  3. Find pitches[b][t] to find the frequency that belongs to that bin?找到pitches[b][t]以找到属于该 bin 的频率?

Pitch detection is a tricky topic and is often counter-intuitive.音高检测是一个棘手的话题,通常是违反直觉的。 I'm not wild about the way the source code is documented for this particular function -- it almost seems like the developer is confusing a 'harmonic' with a 'pitch'.对于这个特定功能的源代码记录方式,我并不感到疯狂——开发人员似乎将“谐波”与“音高”混淆了。

When a single note (a 'pitch') is made on a guitar or piano, what we hear is not just one frequency of sound vibration, but a composite of multiple sound vibrations occurring at different mathematically related frequencies, called harmonics.当在吉他或钢琴上发出单个音符(“音高”)时,我们听到的不仅仅是声音振动的一个频率,而是以不同数学相关频率发生的多个声音振动的组合,称为谐波。 Typical pitch tracking techniques include searching the results of a FFT for magnitudes in certain bins that correspond to the expected frequencies of harmonics.典型的音高跟踪技术包括在 FFT 的结果中搜索与预期的谐波频率相对应的某些 bin 中的幅度。 For instance, if we press the Middle C key on the piano, the individual frequencies of the composite's harmonics will start at 261.6 Hz as the fundamental frequency, 523 Hz would be the 2nd Harmonic, 785 Hz would be the 3rd Harmonic, 1046 Hz would be the 4th Harmonic, etc. The later harmonics are integer multiples of the fundamental frequency, 261.6 Hz ( ex: 2 x 261.6 = 523, 3 x 261.6 = 785, 4 x 261.6 = 1046 ).例如,如果我们按下钢琴上的中间 C 键,复合谐波的各个频率将从 261.6 Hz 作为基频开始,523 Hz 将是 2 次谐波,785 Hz 将是 3 次谐波,1046 Hz 将是 4 次谐波等。后面的谐波是基频 261.6 Hz 的整数倍(例如:2 x 261.6 = 523、3 x 261.6 = 785、4 x 261.6 = 1046)。 However, the frequencies where harmonics are located are logarithmically spaced, but the FFT uses a linear spacing.但是,谐波所在的频率是对数间隔的,但 FFT 使用的是线性间隔。 Often the vertical spacing for FFTs are not resolved enough at the lower frequencies.通常,FFT 的垂直间距在较低频率下没有得到足够的解析。

For that reason when I wrote a pitch detecting application (PitchScope Player), I chose to create a logarithmically spaced DFT, rather than a FFT, so I could focus on the precise frequencies of interest for music ( see the attached diagram of my custom DFT from 3 seconds of a guitar solo ).出于这个原因,当我编写一个音高检测应用程序(PitchScope Player)时,我选择创建一个对数间隔的 DFT,而不是 FFT,这样我就可以专注于音乐感兴趣的精确频率(参见我的自定义 DFT 的附图从 3 秒的吉他独奏开始)。 If you are serious about pursuing pitch detection, you should consider doing more reading into the topic, looking at other sample code (mine is linked below), and consider writing your own functions to measure frequency.如果您对音高检测很认真,您应该考虑对该主题进行更多阅读,查看其他示例代码(我的代码如下链接),并考虑编写自己的函数来测量频率。

https://en.wikipedia.org/wiki/Transcription_(music)#Pitch_detection https://en.wikipedia.org/wiki/Transcription_(music)#Pitch_detection

https://github.com/CreativeDetectors/PitchScope_Player https://github.com/CreativeDetectors/PitchScope_Player

在此处输入图像描述

Turns out the way to pick the pitch at a certain frame t is simple:事实证明,在特定帧t选择音高的方法很简单:

def detect_pitch(y, sr, t):
  index = magnitudes[:, t].argmax()
  pitch = pitches[index, t]

  return pitch

First getting the bin of the strongest frequency by looking at the magnitudes array, and then finding the pitch at pitches[index, t] .首先通过查看magnitudes数组获得最强频率的 bin,然后在pitches[index, t]找到音高。

To find the pitch of the whole audio segment:要查找整个音频片段的音高:

def detect_pitch(y, sr):
    pitches, magnitudes = librosa.core.piptrack(y=y, sr=sr, fmin=75, fmax=1600)
    # get indexes of the maximum value in each time slice
    max_indexes = np.argmax(magnitudes, axis=0)
    # get the pitches of the max indexes per time slice
    pitches = pitches[max_indexes, range(magnitudes.shape[1])]
    return pitches

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM