简体   繁体   English

如何找到SpeechSynthesizer所选语音的音频格式

[英]How can I find the audio format of the selected voice of the SpeechSynthesizer

In a text to speech application by C# I use SpeechSynthesizer class, it has an event named SpeakProgress which is fired for every spoken word. 在C#的文本到语音应用程序中,我使用SpeechSynthesizer类,它有一个名为SpeakProgress的事件,它会针对每个口语单词触发。 But for some voices the parameter e.AudioPosition is not synchronized with the output audio stream, and the output wave file is played faster than what this position shows (see this related question ). 但是对于某些声音,参数e.AudioPosition不与输出音频流同步,并且输出波形文件的播放速度比此位置显示的速度快(请参阅此相关问题 )。

Anyway, I am trying to find the exact information about the bit rate and other information related to the selected voice. 无论如何,我试图找到有关比特率和与所选语音相关的其他信息的确切信息。 As I have experienced if I can initialize the wave file with this information, the synchronizing problem will be resolved. 正如我所经历的那样,如果我可以使用此信息初始化wave文件,则将解决同步问题。 However, if I can't find such information in the SupportedAudioFormat , I know no other way to find them. 但是,如果我在SupportedAudioFormat找不到这样的信息,我知道没有其他方法可以找到它们。 For example the "Microsoft David Desktop" voice provides no supported format in the VoiceInfo , but it seems it supports a PCM 16000 hz, 16 bit format. 例如,“Microsoft David Desktop”语音在VoiceInfo不提供支持的格式,但它似乎支持PCM 16000 hz,16位格式。

How can I find audio format of the selected voice of the SpeechSynthesizer 如何找到SpeechSynthesizer所选语音的音频格式

 var formats = CurVoice.VoiceInfo.SupportedAudioFormats;

 if (formats.Count > 0)
 {
     var format = formats[0];
     reader.SetOutputToWaveFile(CurAudioFile, format);
 }
 else
 {
        var format = // How can I find it, if the audio hasn't provided it?           
        reader.SetOutputToWaveFile(CurAudioFile, format );
}

Update: This answer has been edited following investigation. 更新:此答案已在调查后进行了编辑。 Initially I was suggesting from memory that SupportedAudioFormats is likely just from (possibly misconfigured) registry data; 最初我从内存中建议SupportedAudioFormats可能只是来自(可能是错误配置的)注册表数据; investigation has shown that for me, on Windows 7, this is definitely the case, and is backed up acecdotally on Windows 8. 调查显示,对于我来说,在Windows 7上,情况确实如此,并且在Windows 8上以acecdotally方式备份。

Issues with SupportedAudioFormats SupportedAudioFormats的问题

System.Speech wraps the venerable COM speech API (SAPI) and some voices are 32 vs 64 bit, or can be misconfigured (on a 64 bit machine's registry, HKLM/Software/Microsoft/Speech/Voices vs HKLM/Software/Wow6432Node/Microsoft/Speech/Voices . System.Speech包含古老的COM语音API(SAPI),一些声音是32比64位,或者可能是错误配置的(在64位机器的注册表, HKLM/Software/Microsoft/Speech/VoicesHKLM/Software/Wow6432Node/Microsoft/Speech/Voices

I've pointed ILSpy at System.Speech and its VoiceInfo class, and I'm pretty convinced that SupportedAudioFormats comes solely from registry data, hence it's possible to get zero results back when enumerating SupportedAudioFormats if either your TTS engine isn't properly registered for your application's Platform target (x86, Any or 64 bit), or if the vendor simply doesn't provide this information in the registry. 我已经在System.Speech及其VoiceInfo类中指出了VoiceInfo ,我非常确信SupportedAudioFormats完全来自注册表数据,因此如果您的TTS引擎未正确注册,则在枚举SupportedAudioFormats时可能会得到零结果。您的应用程序的平台目标(x86,Any或64位),或者供应商是否仅在注册表中未提供此信息。

Voices may still support different, additional or fewer formats, as that's up to the speech engine (code) rather than the registry (data). 声音可能仍然支持不同的,额外的或更少的格式,因为这取决于语音引擎(代码)而不是注册表(数据)。 So it can be a shot in the dark. 所以它可以在黑暗中拍摄。 Standard Windows voices are often times more consistent in this regard than third party voices, but they still don't necessarily usefully provide SupportedAudioFormats . 在这方面,标准Windows语音通常比第三方语音更一致,但它们仍然不一定有用地提供SupportedAudioFormats

Finding this Information the Hard Way 努力寻找这些信息

I've found it's still possible to get the current format of the current voice - but this does rely on reflection to access the internals of the System.Speech SAPI wrappers. 我发现它仍然可以获得当前语音的当前格式 - 但这确实依赖于反射来访问System.Speech SAPI包装器的内部。

Consequently this is quite fragile code! 因此,这是非常脆弱的代码! And I wouldn't recommend use in production. 我不建议在生产中使用。

Note : The below code does require you to have called Speak() once for setup; 注意 :以下代码确实要求您为安装调用一次Speak(); more calls would be needed to force setup without Speak(). 如果没有Speak(),则需要更多调用来强制设置。 However, I can call Speak("") to say nothing and that works just fine. 但是,我可以打电话给Speak("")说什么也没关系。

Implementation: 执行:

[StructLayout(LayoutKind.Sequential)]
struct WAVEFORMATEX
{
    public ushort wFormatTag;
    public ushort nChannels;
    public uint nSamplesPerSec;
    public uint nAvgBytesPerSec;
    public ushort nBlockAlign;
    public ushort wBitsPerSample;
    public ushort cbSize;
}

WAVEFORMATEX GetCurrentWaveFormat(SpeechSynthesizer synthesizer)
{
    var voiceSynthesis = synthesizer.GetType()
                                    .GetProperty("VoiceSynthesizer", BindingFlags.Instance | BindingFlags.NonPublic)
                                    .GetValue(synthesizer, null);

    var ttsVoice = voiceSynthesis.GetType()
                                 .GetMethod("CurrentVoice", BindingFlags.Instance | BindingFlags.NonPublic)
                                 .Invoke(voiceSynthesis, new object[] { false });

    var waveFormat = (byte[])ttsVoice.GetType()
                                     .GetField("_waveFormat", BindingFlags.Instance | BindingFlags.NonPublic)
                                     .GetValue(ttsVoice);

    var pin = GCHandle.Alloc(waveFormat, GCHandleType.Pinned);
    var format = (WAVEFORMATEX)Marshal.PtrToStructure(pin.AddrOfPinnedObject(), typeof(WAVEFORMATEX));
    pin.Free();

    return format;
}

Usage: 用法:

SpeechSynthesizer s = new SpeechSynthesizer();
s.Speak("Hello");
var format = GetCurrentWaveFormat(s);
Debug.WriteLine($"{s.Voice.SupportedAudioFormats.Count} formats are claimed as supported.");
Debug.WriteLine($"Actual format: {format.nChannels} channel {format.nSamplesPerSec} Hz {format.wBitsPerSample} audio");

To test it, I renamed Microsoft Anna's AudioFormats registry key under HKLM/Software/Wow6432Node/Microsoft/Speech/Voices/Tokens/MS-Anna-1033-20-Dsk/Attributes , causing SpeechSynthesizer.Voice.SupportedAudioFormats to have no elements when queried. 为了测试它,我在HKLM/Software/Wow6432Node/Microsoft/Speech/Voices/Tokens/MS-Anna-1033-20-Dsk/Attributes下重命名了Microsoft Anna的AudioFormats注册表项,导致SpeechSynthesizer.Voice.SupportedAudioFormats在查询时没有元素。 The below is the output in this situation: 以下是这种情况下的输出:

0 formats are claimed as supported.
Actual format: 1 channel 16000 Hz 16 audio

You can't get this information from code. 您无法从代码中获取此信息。 You can only listen to all formats (from poor format like 8 kHz to high quality format like 48 kHz) and observe where it stops getting better, which is what you did, I think. 您只能收听所有格式(从8 kHz的低格式到48 kHz等高质量格式),并观察它在哪里停止变好,我认为这就是你所做的。

Internally, the speech engine "asks" the voice for the original audio format only once, and I believe that this value is used only internally by the speech engine, and the speech engine does not expose this value in any way. 在内部,语音引擎仅“询问”原始音频格式的语音一次,并且我相信该值仅在语音引擎内部使用,并且语音引擎不以任何方式公开该值。

For further information: 了解更多信息:

Let's say you are a voice company. 假设您是一家语音公司。 You have recorded your computer voice at 16 kHz, 16 bit, mono. 您已将计算机语音录制为16 kHz,16位,单声道。

The user can let your voice speak at 48 kHz, 32 bit, Stereo. 用户可以让您的声音以48 kHz,32位,立体声说话。 The speech engine does this conversion. 语音引擎执行此转换。 The speech engine does not care if it really sounds better, it simply does the format conversion. 语音引擎并不关心它是否真的听起来更好,它只是进行格式转换。

Let's say the user wants to let your voice speak something. 假设用户想让你的声音说话。 He requests that the file will be saved as 48 kHz, 16 bit, stereo. 他要求将文件保存为48 kHz,16位,立体声。

SAPI / System.Speech calls your voice with this method: SAPI / System.Speech使用以下方法调用您的声音:

STDMETHODIMP SpeechEngine::GetOutputFormat(const GUID * pTargetFormatId, const WAVEFORMATEX * pTargetWaveFormatEx,
GUID * pDesiredFormatId, WAVEFORMATEX ** ppCoMemDesiredWaveFormatEx)
{
    HRESULT hr = S_OK;

    //Here we need to return which format our audio data will be that we pass to the speech engine.
    //Our format (16 kHz, 16 bit, mono) will be converted to the format that the user requested. This will be done by the SAPI engine.

    enum SPSTREAMFORMAT sample_rate_at_which_this_voice_was_recorded = SPSF_16kHz16BitMono; //Here you tell the speech engine which format the data has that you will pass back. This way the engine knows if it should upsample you voice data or downsample to match the format that the user requested.

    hr = SpConvertStreamFormatEnum(sample_rate_at_which_this_voice_was_recorded, pDesiredFormatId, ppCoMemDesiredWaveFormatEx);

    return hr;
}

This is the only place where you have to "reveal" what the recorded format of your voice is. 这是您必须“显示”录制的录音格式的唯一地方。

All the "Available formats" rather tell you which conversions your sound card / Windows can do. 所有“可用格式”都会告诉您声卡/ Windows可以进行哪些转换。

I hope I explained it well? 我希望我解释得很好吗? As a voice vendor, you don't support any formats. 作为语音供应商,您不支持任何格式。 You just tell they speech engine what format your audio data is so that it can do the further conversions. 您只需告诉他们语音引擎您的音频数据是什么格式,以便它可以进行进一步的转换。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM