简体   繁体   English

Python, NLP - 查找包含给定单词列表的顶部文档

[英]Python, NLP - finding the top document containing given list of words

I am new to learning NLP.我是学习 NLP 的新手。 I am trying to do an exercise of finding the best matching resume.我正在尝试寻找最匹配的简历。

For example, I have a list of skills that I am looking for like ['java', 'python', 'SQL', 'API', ...], and a set of documents.例如,我有一个我正在寻找的技能列表,例如 ['java', 'python', 'SQL', 'API', ...] 和一组文档。 I want to create a model to find the document that is the best match with these skills.我想创建一个 model 来找到与这些技能最匹配的文档。 Similar to resume matching.类似于简历匹配。

I started with this tutorial - Extracting words from pdf as a reference我从本教程开始 - 从 pdf 中提取单词作为参考

I was able to extract the text from pdf, remove stop words, perform lemmatization, compute the number of times these keywords appear in each document and I am not sure how to go head from here.我能够从 pdf 中提取文本,删除停用词,执行词形还原,计算这些关键字在每个文档中出现的次数,我不知道如何从这里开始 go。

Could anyone provide me with what the next steps should be?谁能告诉我接下来的步骤应该是什么? Any tutorials or references would be helpful as well.任何教程或参考资料也会有所帮助。

if you assume that "the best matching", is the resume with the "largest intersection" with the set skills, then you would have (python):如果您假设“最佳匹配”是与设置技能“最大交集”的简历,那么您将拥有(python):

import numpy as np
D=[["I","know","python"],["I","know","java"]] # list of Documents
skils=["java"] # list of skils
I=[len(list(set(skils) & set(d))) for d in D]
R = sorted(range(len(I)), key=lambda k: I[k]) # rank of intersection with the skillset
best_resume=R[0]
print (R)

I hope it can be useful.我希望它有用。 good luck.祝你好运。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM