Google Speech API + Go - 转录未知长度的音频流

Question

I have an rtmp stream of a video call and I want to transcribe it. 我有一个视频通话的rtmp流，我想转录它。 I have created 2 services in Go and I'm getting results but it's not very accurate and a lot of data seems to get lost. 我已经在Go中创建了2个服务，但是我得到了结果，但它不是很准确，很多数据似乎都丢失了。

Let me explain. 让我解释。

I have a transcode service, I use ffmpeg to transcode the video to Linear16 audio and place the output bytes onto a PubSub queue for a transcribe service to handle. 我有一个transcode服务，我使用ffmpeg将视频转码为Linear16音频，并将输出字节放在PubSub队列上，以便处理transcribe服务。 Obviously there is a limit to the size of the PubSub message, and I want to start transcribing before the end of the video call. 显然，PubSub消息的大小是有限制的，我想在视频通话结束之前开始转录。 So, I chunk the transcoded data into 3 second clips (not fixed length, just seems about right) and put them onto the queue. 因此，我将转码后的数据分成3个第二个剪辑（不是固定长度，似乎是正确的）并将它们放入队列中。

The data is transcoded quite simply: 数据转码非常简单：

var stdout Buffer

cmd := exec.Command("ffmpeg", "-i", url, "-f", "s16le", "-acodec", "pcm_s16le", "-ar", "16000", "-ac", "1", "-")
cmd.Stdout = &stdout

if err := cmd.Start(); err != nil {
    log.Fatal(err)
}

ticker := time.NewTicker(3 * time.Second)

for {
    select {
    case <-ticker.C:
        bytesConverted := stdout.Len()
        log.Infof("Converted %d bytes", bytesConverted)

        // Send the data we converted, even if there are no bytes.
        topic.Publish(ctx, &pubsub.Message{
            Data: stdout.Bytes(),
        })

        stdout.Reset()
    }
}

The transcribe service pulls messages from the queue at a rate of 1 every 3 seconds, helping to process the audio data at about the same rate as it's being created. transcribe服务以每3秒1的速率从队列中提取消息，有助于以与创建时相同的速率处理音频数据。 There are limits on the Speech API stream, it can't be longer than 60 seconds so I stop the old stream and start a new one every 30 seconds so we never hit the limit, no matter how long the video call lasts for. Speech API流有限制，它不能超过60秒，所以我停止旧流并每30秒启动一个新流，所以我们永远不会达到限制，无论视频通话持续多长时间。

This is how I'm transcribing it: 这就是我的抄写方式：

stream := prepareNewStream()
clipLengthTicker := time.NewTicker(30 * time.Second)
chunkLengthTicker := time.NewTicker(3 * time.Second)

cctx, cancel := context.WithCancel(context.TODO())
err := subscription.Receive(cctx, func(ctx context.Context, msg *pubsub.Message) {

    select {
    case <-clipLengthTicker.C:
        log.Infof("Clip length reached.")
        log.Infof("Closing stream and starting over")

        err := stream.CloseSend()
        if err != nil {
            log.Fatalf("Could not close stream: %v", err)
        }

        go getResult(stream)
        stream = prepareNewStream()

    case <-chunkLengthTicker.C:
        log.Infof("Chunk length reached.")

        bytesConverted := len(msg.Data)

        log.Infof("Received %d bytes\n", bytesConverted)

        if bytesConverted > 0 {
            if err := stream.Send(&speechpb.StreamingRecognizeRequest{
                StreamingRequest: &speechpb.StreamingRecognizeRequest_AudioContent{
                    AudioContent: transcodedChunk.Data,
                },
            }); err != nil {
                resp, _ := stream.Recv()
                log.Errorf("Could not send audio: %v", resp.GetError())
            }
        }

        msg.Ack()
    }
})

I think the problem is that my 3 second chunks don't necessarily line up with starts and end of phrases or sentences so I suspect that the Speech API is a recurrent neural network which has been trained on full sentences rather than individual words. 我认为问题是我的3秒块不一定与短语或句子的开头和结尾对齐，所以我怀疑Speech API是一个反复出现的神经网络，它已经被训练成完整的句子而不是单个单词。 So starting a clip in the middle of a sentence loses some data because it can't figure out the first few words up to the natural end of a phrase. 因此，在句子中间开始剪辑会丢失一些数据，因为它无法找出直到词组自然结尾的前几个单词。 Also, I lose some data in changing from an old stream to a new stream. 此外，我在从旧流更改为新流时丢失了一些数据。 There's some context lost. 有一些背景丢失了。 I guess overlapping clips might help with this. 我猜重叠的剪辑可能对此有所帮助。

I have a couple of questions: 我有一些问题：

1) Does this architecture seem appropriate for my constraints (unknown length of audio stream, etc.)? 1）这种架构是否适合我的约束（音频流的未知长度等）？

2) What can I do to improve accuracy and minimise lost data? 2）我可以做些什么来提高准确性并最大限度地减少数据丢失？

(Note I've simplified the examples for readability. Point out if anything doesn't make sense because I've been heavy handed in cutting the examples down.) （注意我已经简化了可读性的例子。指出是否有任何意义没有意义，因为我一直在努力削减这些例子。）

Answer 1

I think you are right that splitting the text into chunks causes many words to be chopped off. 我认为你是正确的，将文本分成块会导致许多单词被删除。

I see another problem in the publishing. 我看到出版中的另一个问题。 Between the calls topic.Publish and stdout.Reset() some time will pass and ffmpeg will probably have written some unpublished bytes to stdout, which will get cleared by the reset. 在调用topic.Publish和stdout.Reset()会有一些时间通过，而ffmpeg可能会将一些未发布的字节写入stdout，这将被重置清除。

I am afraid the architecture is not fitted for your problem. 我担心这个架构不适合你的问题。 The constraint of the message size causes many problems. 消息大小的约束导致许多问题。 The idea of a PubSub system is that a publisher notifies subscribers of events, but not necessarily to hold a large payload. PubSub系统的想法是发布者通知订阅者事件，但不一定要保留大的有效负载。

Do you really need two services? 你真的需要两项服务吗？ You could use two go routines to communicate via a channel. 您可以使用两个例程通过频道进行通信。 That would eliminate the pub sub system. 这将消除pub子系统。

A strategy would be to make the chunks as large as possible. 一种策略是使块尽可能大。 A possible solution: 可能的解决方案：

Make the chunks as large as possible (nearly 60 seconds) 使块尽可能大（接近60秒）
Make the chunks overlap each other by a short time (eg 5 seconds) 使块重叠很短的时间（例如5秒）
Programmatically detect the overlaps and remove them 以编程方式检测重叠并删除它们

Google Speech API + Go - 转录未知长度的音频流

问题描述

1 个解决方案

解决方案1
1 2018-02-14 18:18:00

Google Speech API + Go - 转录未知长度的音频流

问题描述

1 个解决方案

解决方案1 1 2018-02-14 18:18:00

解决方案1
1 2018-02-14 18:18:00