简体   繁体   English

算法从句子中的单词中获取句子的主题/焦点

[英]algorithm to get topic / focus of sentence out of words in sentence

Are there any well-know or successful algorithms for obtaining the topic and / or focus of a sentence ( question ) out of the words in the sentence question? 是否有任何众所周知或成功的算法用于从句子问题中的单词中获取主题和/或句子(问题)的焦点?

If not, how would I got about getting the topic / focus of the question. 如果没有,我将如何获得问题的主题/焦点。 It seems that the topic / focus of the questions is usually a noun or a noun-phrase. 似乎问题的主题/焦点通常是名词或名词短语。

So the first thing I would do is determine the nouns by Part Of Speech tagging the question. 所以我要做的第一件事就是通过Part Of Speech标记问题来确定名词。 but then how do I know if I should get just the nouns or the noun(s) and a adjective before it, or the noun and the adverb before it, or the noun(s) and verb? 但是,我怎么知道我是否应该只得到它前面的名词或名词和形容词,或它之前的名词和副词,或者名词和动词?

For example: 例如:

In ' did the quick brown fox jump over the lazy dog ', get ' quick brown fox ', ' jump ', and ' lazy dog '. 在'快速的棕色狐狸跳过懒狗',得到'快速棕色狐狸','跳'和'懒狗'。

In ' what is the population of japan ', get ' population ' and ' japan ' 在'日本人口是什么'中,获得'人口'和'日本'

In ' what color is milk ' get ' color ' and ' milk ' 在'什么颜色是牛奶'获得'颜色'和'牛奶'

In ' What is the height of Mt. 在'什么是山的高度。 Everest ' get ' Mt. 珠穆朗玛峰'得'山 Everst ' and ' Height '. Everst'和'Height'。

While writing these I guess the easiest way is removing stop words. 写这些时,我想最简单的方法就是删除停用词。

I think first of all that the problem is language-dependent. 我认为首先问题是语言依赖。

Secondly I think that if you have a set of words, you could run a check on their popularity/frequency in the language; 其次我认为,如果你有一套单词,你可以用语言检查他们的受欢迎程度/频率; fe the word "the" occurs much more often that the word "euphoric" => euphoric has more chance of being a proper keyword. fe“the”这个词经常出现,“euphoric”这个词=> euphoric更有可能成为一个合适的关键词。

Here the importance of spelling is however crucial. 然而,拼写的重要性至关重要。 How to deal with this? 怎么处理这个? One idea is to use distance-algorithms such as Levenshtein to words that do not occur often (or do a google-search with the word and check if you get results or a "did-you-mean"-notification) 一个想法是使用Levenshtein之类的距离算法来解决经常不会出现的问题(或者使用单词进行谷歌搜索并检查是否得到结果或“你是不是意味着” - 通知)

Some languages are though more structured that other. 有些语言虽然比其他语言更有条理。 In english to find nouns, you can run first a check with "a/an word" and then words that end in "s" to find possible candidates for nouns. 在英语中找到名词,你可以首先用“a / an word”检查,然后用“s”结尾的单词找到可能的名词候选者。 Then make a comparison with a dictionary. 然后与字典进行比较。

With adjectives you can perhaps assume that a possible adjective will be located right before the noun. 对于形容词,您可以假设可能的形容词将位于名词之前。 Then just compare the possible adjective with the dictionary. 然后只需将可能的形容词与字典进行比较。

Then you could of course keep a black-list of words that are never allowed as keywords. 那么你当然可以保留一个永远不允许作为关键词的黑名单。

The best solution would perhaps be to have a self-learning neural system but I'm not so familiar with those to give any suggestions 最好的解决方案可能是拥有一个自学习神经系统,但我不太熟悉那些提出任何建议

This could be thought of as a parsing problem and I personally find the stanford nlp tool very effective . 这可以被认为是一个解析问题,我个人觉得stanford nlp工具非常有效。

Here is the link to the demo of the stanford parser 是stanford解析器演示的链接

For the example , did the quick brown fox jump over the lazy dog The output you get is 对于这个例子,快速的棕色狐狸跳过懒狗。你得到的输出是

did/VBD
the/DT
quick/JJ
brown/JJ
fox/NN
jump/VB
over/RP
the/DT
lazy/JJ
dog/NN

From the output you can write an extractor to extract the nouns ( adjectives and adverbs if need be) and thus obtain the topics from the sentence . 从输出中,您可以编写一个提取器来提取名词(如果需要,可以提取形容词和副词),从而从句子中获取主题。

Moreover , the parse tree looks like 而且,解析树看起来像

(ROOT
  (SINV (VBD did)
    (NP (DT the) (JJ quick) (JJ brown) (NN fox))
    (VP (VB jump)
      (PRT (RP over))
      (NP (DT the) (JJ lazy) (NN dog)))))

If you take a closer look at the parse tree , the output you are expecting are both the NP(noun phrases) - the quick brown fox and the lazy dog . 如果你仔细看看解析树,你期望的输出都是NP(名词短语) - 快速的棕色狐狸和懒狗。

I hope this helps ! 我希望这有帮助 !

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM