简体   繁体   English

Returnn总机数据处理

[英]Returnn Switchboard data processing

Could anybody give me pointers on how to process Switchboard dataset for training with RETURNN? 谁能给我有关如何处理Switchboard数据集以进行RETURNN培训的指导吗? I did see BlissDataset class that seems to be designed for switchboard, but it's not clear to me what I should include in the paths given in the example: 我确实看到了BlissDataset类,该类似乎是为配电盘设计的,但是我不清楚在示例中给出的路径中应该包含什么:

Example:
    ./tools/dump-dataset.py "
      {'class':'BlissDataset',
       'path': '/u/tuske/work/ASR/switchboard/corpus/xml/train.corpus.gz',
       'bpe_file': '/u/zeyer/setups/switchboard/subwords/swb-bpe-codes',
       'vocab_file': '/u/zeyer/setups/switchboard/subwords/swb-vocab'}"

The switchboard dataset has several folders with audios, ie swb1_d2/data/*.sph and transcripts swb1_LDC97S62/swb_ms98_transcriptions/**/* I'm not quite sure how to proceed with this to get a dataset that can be used to train RETURNN. 总机数据集有几个带有音频的文件夹,即swb1_d2 / data / *。sph和记录本swb1_LDC97S62 / swb_ms98_transcriptions / ** / *我不太确定如何进行此操作以获取可用于训练RETURNN的数据集。

At our group (RWTH Aachen University), we use the config as it was published on GitHub. 在我们的小组(亚琛工业大学),我们使用在GitHub上发布的配置。 As you see, this one uses ExternSprintDataset . 如您所见,此代码使用ExternSprintDataset That dataset uses The implementation uses Sprint (publicly called RWTH ASR (RASR), see here ) as an external tool (ran in a subprocess) to handle the data (feature extraction, etc). 该数据集使用该实现使用Sprint(统称为RWTH ASR(RASR),请参见此处 )作为外部工具(在子流程中运行)来处理数据(功能提取等)。 Sprint gets a Bliss XML file which describes all the segments with path to audio and audio offsets and transcriptions, and also it gets further configs for the feature extraction and maybe other things. Sprint获得了Bliss XML文件,该文件描述了所有片段以及音频,音频偏移量和转录的路径,并且还获得了用于特征提取和其他功能的更多配置。 There is an open source version of RASR which should work but it might be a bit involved to get this to work. 有一个应该可用的RASR开源版本,但是要使其正常工作可能会涉及一些工作。

The BlissDataset was planned to be a simpler replacement for that. 计划将BlissDataset替换为更简单的方法。 However, the implementation is incomplete. 但是,实现不完整。 Also, you still would need to generate the Bliss XML by yourself in some way (we have used some own internal scripts to prepare that based on the official LDC data). 同样,您仍然需要以某种方式自己生成Bliss XML(我们已经使用了一些内部脚本根据官方LDC数据来准备它们)。

So, unfortunately, there is no simple way yet. 因此,不幸的是,还没有简单的方法。 Actually, I think the easiest way would be to come up with yet another custom format, which might be similar to the LibriSpeechDataset implementation, or maybe just the same, and then you could just reuse LibriSpeechDataset , or at least parts of that. 实际上,我认为最简单的方法是提出另一种自定义格式,该格式可能与LibriSpeechDataset实现类似,或者也许是相同的,然后可以重用LibriSpeechDataset ,或者至少重用LibriSpeechDataset一部分。 That dataset implementation takes the data in some zip format which contains the transcripts in txt files and the audio in ogg or wav files. 该数据集实现采用某种zip格式的数据,其中包含txt文件中的脚本和ogg或wav文件中的音频。 It uses librosa to do MFCC feature extraction (or also other feature types). 它使用librosa进行MFCC特征提取(或其他特征类型)。 I planned to implement that for Switchboard, and then reproduce the results, however I did not have time yet and not sure when I will get to that. 我计划在Switchboard上实现该功能,然后重现结果,但是我还没有时间,也不确定何时才能实现。 But if you want to try that on your own, I will be happy to help you however I can. 但是,如果您想自己尝试一下,我们将竭诚为您服务。 The starting point would be to look at LibriSpeechDataset and understand how the format of that looks like. 起点是查看LibriSpeechDataset并了解其格式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM