简体   繁体   English

使用 BERT 或 LSTM 模型的大型文档语义相似性的最佳方法

[英]Best approach for semantic similarity in large documents using BERT or LSTM models

I am trying to build a search application for resumes which are in.pdf format.我正在尝试为 .pdf 格式的简历构建一个搜索应用程序。 For a given search query like "who is proficient in Java and worked in an MNC", the output should be the CV which is most similar.对于给定的搜索查询,如“谁精通 Java 并在跨国公司工作”,output 应该是最相似的 CV。 My plan is to read pdf text and find the cosine similarity between the text and the query.我的计划是阅读 pdf 文本并找出文本与查询之间的余弦相似度。

However, BERT has a problem with long documents.但是,BERT 在处理长文档方面存在问题。 It supports a sequence length of only 512 but all my CVs have more than 1000 words.它只支持 512 的序列长度,但我所有的简历都超过 1000 字。 I am really stuck here.我真的被困在这里了。 Methods like truncating the documents don't suit the purpose.截断文档等方法不适合此目的。

Is there any other model that can do this?还有其他 model 可以做到这一点吗?

I could not find the right approach with models like Longformer and XLNet for this task.对于此任务,我找不到使用 Longformer 和 XLNet 等模型的正确方法。

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" 
model = hub.load(module_url)
print ("module %s loaded" % module_url)

corpus = list(documents.values())
sentence_embeddings = model(corpus)
query = "who is profiecient in C++ and has Rust"
query_vec = model([query.lower()])[0]

doc_names = list(documents.keys())

results = []
for i,sent in enumerate(corpus):
  sim = cosine(query_vec, model([sent])[0])
  results.append((i,sim))
  #print("Document = ", doc_name[i], "; similarity = ", sim)

print(results)
results= sorted(results, key=lambda x: x[1], reverse=True)
print(results)

for idx, distance in results[:5]:
  print(doc_names[idx].strip(), "(Cosine Score: %.4f)" % (distance))

I advise you to read: Beltagy, Iz, Matthew E. Peters, and Arman Cohan.我建议您阅读:Beltagy、Iz、Matthew E. Peters 和 Arman Cohan。 "Longformer: The long-document transformer." “Longformer:长文档转换器。” arXiv preprint arXiv:2004.05150 (2020). arXiv 预印本 arXiv:2004.05150 (2020)。

The main goal of this paper is that it is able to receive long document sequence tokens as input and is able to process long-term cross-partition context across the document with a linear computational cost.本文的主要目标是它能够接收长文档序列标记作为输入,并且能够以线性计算成本处理跨文档的长期跨分区上下文。

Here, the sliding window attention mechanism uses n = 512 tokens instead of what is known in the BERT model which takes N=512 tokens as input sequence length.在这里,滑动 window 注意力机制使用n = 512标记,而不是 BERT model 中已知的将N=512标记作为输入序列长度。


Longformer: The Long-Document Transformer Longformer:长文档转换器

GitHub: https://github.com/allenai/longformer GitHub: https://github.com/allenai/longformer

Paper: https://arxiv.org/abs/2004.05150论文: https://arxiv.org/abs/2004.05150

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 语义相似度的 BERT 嵌入 - BERT embedding for semantic similarity 使用NLP /语义相似度从大型文档中提取与一组预定义准则相关的关键字的方法 - Methods to extract keywords from large documents that are relevant to a set of predefined guidelines using NLP/ Semantic Similarity 是否可以使用 Google BERT 计算两个文本文档之间的相似度? - Is it possible to use Google BERT to calculate similarity between two textual documents? 使用 BERT 计算两个词之间的余弦相似度 - Calculate cosine similarity between 2 words using BERT 词/句相似度。 最好的方法是什么? - Word/Sentence similarity. What is the best approach? 如何使用 BERT model 预测与没有 label 的数据集的句子语义相似度? - How can I use BERT model to predict sentence semantic similarity to a dataset with no label? 在 python 中测量多种语言文本之间相似性的最佳方法是什么? - What is the best approach to measure a similarity between texts in multiple languages in python? 这种数据库模型结构的最佳方法是什么? - What's the best approach for this database models structure? 使用Python在上位词级别进行类比的文本语义相似度 - Text semantic similarity by analogy in the hypernym level using Python 使用WordNet确定两个文本之间的语义相似度? - Using WordNet to determine semantic similarity between two texts?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM