使用深度语音训练数据集时对数据错误进行标记

Question

在遵循此教程（ https://medium.com/@klintcho/creating-an-open-speech-recognition-dataset-for-almost-any-language-c532fb2bc0cf ）教程的同时，我创建了语音数据集以进行DeepSpeech训练。

但是，我无法使用深语音训练我的数据集。

它由于火车命令而产生错误，例如

python DeepSpeech.py --train_files /mnt/c/wsl/teneke_out_bolum1/

它抛出一个错误：

pandas.errors.ParserError：标记数据时出错。 C错误：在源上调用read（nbytes）失败。 尝试使用engine ='python'。

我已经在aeneas强制对齐并使用finetuneas进行微调之后创建了数据集：

这是我在Google Colab上用于DeepSpeech训练的代码：

https://gist.github.com/mustafaxfe/d20be114ca7cea5c47ea5cc85653c761

我在Google上找到了一些解决方案，例如

data = pd.read_csv('file1.csv', error_bad_lines=False)

另外作为错误输出，我可以通过设置解决

发动机=“蟒”

但是，我不知道应该在哪里更改。

因此，我应该在哪里编辑以解决此问题。

谢谢。

Answer 1

您的命令需要重新访问：

您指向的是火车数据文件夹。 您应该指向一个.csv文件
使用Python3

您的运行命令应如下所示。 检查文档并修改您的需求。

   python3 -u DeepSpeech.py \
    --train_files /data/phonetic_speech_dta/train/train.csv \
    --dev_files /data/phonetic_speech_dta/dev/dev.csv \
    --test_files /data/phonetic_speech_dta/test/test.csv \
    --train_batch_size 64 \
    --dev_batch_size 32 \
    --test_batch_size 64 \
    --n_hidden 800\
    --validation_step 1\
    --display_step 1 \
    --epoch 100 \
    --log_level 1 \
    --dropout_rate 0.2 \
    --learning_rate 0.001 \
    --drop_count_weight 3.5 \
    --export_dir /speech2text/norwegian_model/results/model_export/ \
    --checkpoint_dir /speech2text/norwegian_model/results/checkpoint/ \
    --decoder_library_path /home/nvidia/tensorflow/bazel-bin/native_client/libctc_decoder_with_kenlm.so \
    --alphabet_config_path /speech2text/norwegian_model/alphabet.txt \
    --lm_binary_path /speech2text/norwegian_model/lm.binary \
    --lm_trie_path /speech2text/norwegian_model/trie

使用深度语音训练数据集时对数据错误进行标记

问题描述

1 个解决方案

解决方案1
0 2019-03-13 10:07:53

使用深度语音训练数据集时对数据错误进行标记

问题描述

1 个解决方案

解决方案1 0 2019-03-13 10:07:53

解决方案1
0 2019-03-13 10:07:53