简体   繁体   中英

Best approach for semantic similarity in large documents using BERT or LSTM models

I am trying to build a search application for resumes which are in.pdf format. For a given search query like "who is proficient in Java and worked in an MNC", the output should be the CV which is most similar. My plan is to read pdf text and find the cosine similarity between the text and the query.

However, BERT has a problem with long documents. It supports a sequence length of only 512 but all my CVs have more than 1000 words. I am really stuck here. Methods like truncating the documents don't suit the purpose.

Is there any other model that can do this?

I could not find the right approach with models like Longformer and XLNet for this task.

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" 
model = hub.load(module_url)
print ("module %s loaded" % module_url)

corpus = list(documents.values())
sentence_embeddings = model(corpus)
query = "who is profiecient in C++ and has Rust"
query_vec = model([query.lower()])[0]

doc_names = list(documents.keys())

results = []
for i,sent in enumerate(corpus):
  sim = cosine(query_vec, model([sent])[0])
  results.append((i,sim))
  #print("Document = ", doc_name[i], "; similarity = ", sim)

print(results)
results= sorted(results, key=lambda x: x[1], reverse=True)
print(results)

for idx, distance in results[:5]:
  print(doc_names[idx].strip(), "(Cosine Score: %.4f)" % (distance))

I advise you to read: Beltagy, Iz, Matthew E. Peters, and Arman Cohan. "Longformer: The long-document transformer." arXiv preprint arXiv:2004.05150 (2020).

The main goal of this paper is that it is able to receive long document sequence tokens as input and is able to process long-term cross-partition context across the document with a linear computational cost.

Here, the sliding window attention mechanism uses n = 512 tokens instead of what is known in the BERT model which takes N=512 tokens as input sequence length.


Longformer: The Long-Document Transformer

GitHub: https://github.com/allenai/longformer

Paper: https://arxiv.org/abs/2004.05150

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM