How do I know which spectrogram frames belong to which audio samples?

Question

I've been using this script:

spgram = torchaudio.transforms.Spectrogram(512, hop_length=32)
audio = spgram(audio)

to get the spectrogram of some stereo music audio. I expected that the resulting spectrogram has the shape [2, 257, audio.shape[1]/32] However, that's not the case. For examples, an audio clip with size [2, 199488] (with sr=24576) yields a spectrogram with size [2, 257, 6241] (note that 199488/32=6234). Why is that? and how can I convert from frame location to sample location?

Answer 1

See center parameter.

whether to pad waveform on both sides so that the t -th frame is centered at time tx hop_length. (Default: True )

So, by default, the signal is padded with zeros. The padding length is probably ( win_length - hop_length ). This ends up making the result longer by (win_length - hop_length) / hop_length , which is 7 in your case.

Answer 2

Thanks for your answers. If I have a signal x with the size of [1,128000], it is 800 frames. torch.stft(x).size() = [1,201,801,2]. I want to align the frames of torch.stft(x) to 800 frames. Can I lose the last frame, only keep the first 800 frames?

How do I know which spectrogram frames belong to which audio samples?

Question

1 answers

solution1
0 2021-10-04 20:30:12

solution2
-2 2023-01-15 03:52:35

How do I know which spectrogram frames belong to which audio samples?

Question

1 answers

solution1 0 2021-10-04 20:30:12

solution2 -2 2023-01-15 03:52:35

solution1
0 2021-10-04 20:30:12

solution2
-2 2023-01-15 03:52:35