I've been using this script:
spgram = torchaudio.transforms.Spectrogram(512, hop_length=32)
audio = spgram(audio)
to get the spectrogram of some stereo music audio. I expected that the resulting spectrogram has the shape [2, 257, audio.shape[1]/32] However, that's not the case. For examples, an audio clip with size [2, 199488] (with sr=24576) yields a spectrogram with size [2, 257, 6241] (note that 199488/32=6234). Why is that? and how can I convert from frame location to sample location?
See center
parameter.
whether to pad
waveform
on both sides so that thet
-th frame is centered at time tx hop_length. (Default:True
)
So, by default, the signal is padded with zeros. The padding length is probably ( win_length - hop_length
). This ends up making the result longer by (win_length - hop_length) / hop_length
, which is 7 in your case.
Thanks for your answers. If I have a signal x with the size of [1,128000], it is 800 frames. torch.stft(x).size() = [1,201,801,2]. I want to align the frames of torch.stft(x) to 800 frames. Can I lose the last frame, only keep the first 800 frames?
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.