长文档的 Huggingface 文档摘要

Question

我希望摘要任务通常假设长文档。 但是，按照此处的文档，我所做的任何简单摘要调用都说我的文档太长：

>>> summarizer = pipeline("summarization")
>>> summarizer(fulltext)
Token indices sequence length is longer than the specified maximum sequence length for this model (5620 > 1024). Running this sequence through the model will result in indexing errors

>>> summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
>>> summary = summarizer(fulltext)
Token indices sequence length is longer than the specified maximum sequence length for this model (8084 > 1024). Running this sequence through the model will result in indexing errors

>>> summarizer = pipeline("summarization", model="t5-base", tokenizer="t5-base")
>>> summary = summarizer(fulltext)
Token indices sequence length is longer than the specified maximum sequence length for this model (5971 > 512). Running this sequence through the model will result in indexing errors

什么 model 或配置选择使这最自动化？ 我读过其他建议手动分块数据或截断的问题，但边界和块长度的选择似乎会对摘要产生影响。 任意长文档的最佳实践是什么？ （无界会很棒，但假设至少有 50,000 个令牌。）

Answer 1

我假设最小标记长度为 50k 意味着您正在尝试总结像小说一样大的东西。 不幸的是，我们还没有可以同时处理这么多数据的 model。 这主要是因为此类型号的 memory 占用空间非常高，无法在生产中使用。 但是pegasus (google)、 Longformer 、 Reformer都是总结长文档的可行选择。 仍在继续研究创建可以在不消耗大量资源的情况下处理更大序列的模型。 例如，reformer 本身经过高度优化，可以处理大量令牌https://huggingface.co/blog/reformer 。 到目前为止，最佳实践是“分而治之”的方法。 即，将您的数据分块，保持最大长度作为参考。 您甚至可以在迭代中执行此操作，直到达到指定的摘要长度。 您还可以探索不同的摘要方法，例如提取和抽象摘要，并利用您的创造力将这些技术组合起来，例如提取摘要和抽象摘要。

长文档的 Huggingface 文档摘要

问题描述

1 个解决方案

解决方案1
0 2021-12-20 09:05:49

长文档的 Huggingface 文档摘要

问题描述

1 个解决方案

解决方案1 0 2021-12-20 09:05:49

解决方案1
0 2021-12-20 09:05:49