简体   繁体   English

关键字提取如何工作?

[英]How does keyword extraction works?

I tested the keyword extraction from the Natural Language Understanding service from IBM with the following text: 我用以下文本测试了来自IBM的Natural Language了解服务的关键字提取:

Desarrollo PDA. Desarrollo PDA。 Ajustes PDA. 调整PDA。 Nuevo modulo PDA. Nuevo模数PDA。 Ajustes modulo PDA. 调整PDA模数。 No sincroniza PDA. 没有sincroniza PDA。 Error modulo PDA. PDA模错误。

And i got the following response: 我得到以下回应:

  • modulo pda with 98.31% relevance pda的模数相关度为98.31%
  • ajustes modulo pda with 64.44% relevance 调整具有64.44%相关性的模数pda
  • nuevo modulo pda with 64.34 relevance 具有64.34相关性的新模PDA pda

Now my question is why is "modulo pda" keyword relevance 98.31% and not just "PDA" with a higher relevance?. 现在我的问题是,为什么“ modulo pda”关键字的相关性为98.31%,而不仅仅是具有较高相关性的“ PDA”? I've been searching everywhere about how does IBM works with no avail. 我到处都在搜索IBM如何工作而无济于事。

The actual algorithm used to extract and score keywords would be a corporate proprietary recipe, I won't expect them to make it public. 用于提取关键字并为其评分的实际算法将是公司专有的配方,我不希望它们将其公开。 But you can find lot of research papers on that topic but usually the final commercial products would contain mix of different techniques to get the best results. 但是您可以找到很多关于该主题的研究论文,但通常最终的商业产品将包含各种不同技术的组合以获得最佳结果。

You can compare the different NLU services from different provides, like IBM, Google, Amazon and compare the results. 您可以比较来自不同产品(例如IBM,Google,Amazon)的不同NLU服务,并比较结果。

Specifically for your query, you are trying to extract keywords or topics from a single document. 专门针对您的查询,您尝试从单个文档中提取关键字或主题。 PDA occurs in every sentence in your document. PDA出现在文档中的每个句子中。 If we apply a simple technique like TF-IDF where each sentence is a document, the the TF-IDF=0 for the word PDA since it occurs in every sentence and becomes irrelevant since its not adding an information to overall topic or document importance. 如果我们使用诸如TF-IDF之类的简单技术,其中每个句子都是一个文档,则PDA单词的TF-IDF = 0,因为它出现在每个句子中,并且变得无关紧要,因为它没有为总体主题或文档重要性添加信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM