使用 BERT 或 LSTM 模型的大型文档语义相似性的最佳方法

Question

I am trying to build a search application for resumes which are in.pdf format.我正在尝试为 .pdf 格式的简历构建一个搜索应用程序。 For a given search query like "who is proficient in Java and worked in an MNC", the output should be the CV which is most similar.对于给定的搜索查询，如“谁精通 Java 并在跨国公司工作”，output 应该是最相似的 CV。 My plan is to read pdf text and find the cosine similarity between the text and the query.我的计划是阅读 pdf 文本并找出文本与查询之间的余弦相似度。

However, BERT has a problem with long documents.但是，BERT 在处理长文档方面存在问题。 It supports a sequence length of only 512 but all my CVs have more than 1000 words.它只支持 512 的序列长度，但我所有的简历都超过 1000 字。 I am really stuck here.我真的被困在这里了。 Methods like truncating the documents don't suit the purpose.截断文档等方法不适合此目的。

Is there any other model that can do this?还有其他 model 可以做到这一点吗？

I could not find the right approach with models like Longformer and XLNet for this task.对于此任务，我找不到使用 Longformer 和 XLNet 等模型的正确方法。

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" 
model = hub.load(module_url)
print ("module %s loaded" % module_url)

corpus = list(documents.values())
sentence_embeddings = model(corpus)
query = "who is profiecient in C++ and has Rust"
query_vec = model([query.lower()])[0]

doc_names = list(documents.keys())

results = []
for i,sent in enumerate(corpus):
  sim = cosine(query_vec, model([sent])[0])
  results.append((i,sim))
  #print("Document = ", doc_name[i], "; similarity = ", sim)

print(results)
results= sorted(results, key=lambda x: x[1], reverse=True)
print(results)

for idx, distance in results[:5]:
  print(doc_names[idx].strip(), "(Cosine Score: %.4f)" % (distance))

Answer 1

I advise you to read: Beltagy, Iz, Matthew E. Peters, and Arman Cohan.我建议您阅读：Beltagy、Iz、Matthew E. Peters 和 Arman Cohan。 "Longformer: The long-document transformer." “Longformer：长文档转换器。” arXiv preprint arXiv:2004.05150 (2020). arXiv 预印本 arXiv:2004.05150 (2020)。

The main goal of this paper is that it is able to receive long document sequence tokens as input and is able to process long-term cross-partition context across the document with a linear computational cost.本文的主要目标是它能够接收长文档序列标记作为输入，并且能够以线性计算成本处理跨文档的长期跨分区上下文。

Here, the sliding window attention mechanism uses n = 512 tokens instead of what is known in the BERT model which takes N=512 tokens as input sequence length.在这里，滑动 window 注意力机制使用n = 512标记，而不是 BERT model 中已知的将N=512标记作为输入序列长度。

Longformer: The Long-Document Transformer Longformer：长文档转换器

GitHub: https://github.com/allenai/longformer GitHub: https://github.com/allenai/longformer

Paper: https://arxiv.org/abs/2004.05150论文： https://arxiv.org/abs/2004.05150

使用 BERT 或 LSTM 模型的大型文档语义相似性的最佳方法

问题描述

1 个解决方案

解决方案1
1 2021-03-13 15:02:55

使用 BERT 或 LSTM 模型的大型文档语义相似性的最佳方法

问题描述

1 个解决方案

解决方案1 1 2021-03-13 15:02:55

解决方案1
1 2021-03-13 15:02:55