简体   繁体   English

倒排索引可以在一个条目中包含多个单词吗?

[英]Can inverted index have multiple words in one entry?

In information retrieval, the inverted index has entries which are the words of corpus, and each word has a posting list which is the list of documents it appears in.在信息检索中,倒排索引的条目是语料库的单词,每个单词都有一个发布列表,即它出现的文档列表。

If stemming is applied, index entry would be a stem, so multiple words may finally map to the same entry if they share the same stem.如果应用了词干,索引条目将是一个词干,因此如果多个词共享相同的词干,它们最终可能会 map 到同一个词条。 For example:例如:

Without stemming:没有词干:

(slowing) --> [D1, D5, D9,...]

(slower) --> [D9, D10, D20,...]

(slow) --> [D2,...]

With stemming:使用词干:

(slow) --> [D1, D2, D5, D9, , D10, D20...]

I want to avoid stemming, and instead would like to make each entry in my inverted index as a bag of words (inflections) such as (slow, slows, slowing, slowed, slower, slowest).我想避免词干,而是想将我的倒排索引中的每个条目作为一个词袋(变形),例如(慢,慢,慢,慢,慢,最慢)。 For example:例如:

(slow, slows, slowing, slowed, slower, slowest) --> [D1, D2, D5, D9, , D10, D20...]

Would that be possible and feasible or not?这是否可能和可行?

Thanks in advance提前致谢

Short Answer: Simply avoid stemming to suit your need of not considering slow and slows to be a match.简短的回答:简单地避免词干以适应您的需要,而不是考虑将slowslows作为匹配项。

Long Answer:长答案:

Question: I want to avoid stemming, and instead would like to make each entry in my inverted index as a bag of words (inflections) such as (slow, slows, slowing, slowed, slower, slowest).问题: I want to avoid stemming, and instead would like to make each entry in my inverted index as a bag of words (inflections) such as (slow, slows, slowing, slowed, slower, slowest).

Let me try to clear some confusion that you have about inverted lists.让我尝试消除您对倒排列表的一些困惑。 It is the documents that are stored in the postings for each term (not the terms themselves).它是存储在每个术语的过帐中的文档(而不是术语本身)。

The words are typically stored in a in-memory dictionary (implemented with a hash-table or a trie) containing pointers to the postings (list of documents which contain that particular term) stored and loaded on the fly from secondary storage.这些词通常存储在内存字典中(使用哈希表或 trie 实现),其中包含指向从辅助存储中动态存储和加载的帖子(包含该特定术语的文档列表)的指针。

A simple example (without showing document weights):一个简单的例子(不显示文档权重):

(information) --> [D1, D5, D9,...] (informative) --> [D9, D10, D20,...] (retrieval) --> [D1, D9, D17,...]..

So, if you don't want to apply stemming, that's fine... In fact, the above example shows an unstemmed index, where the words information and informative appear in their non-conflated forms.所以,如果你不想应用词干提取,那很好......事实上,上面的示例显示了一个未提取词干的索引,其中informationinformative这两个词出现在它们的非合并 forms 中。 In a conflated term index (with a stemmer or a lemmatizer), you would substitute the different forms with an equivalent representation (say inform ).在合并的术语索引(使用词干分析器或词形还原器)中,您可以用等效的表示形式(比如inform )替换不同的 forms 。 In that case, the index will be:在这种情况下,索引将是:

(inform) --> [D1, D5, D9, D10, D20...]. --- union of the different forms (retrieval) --> [D1, D9, D17,...]..

So, this conflated representation matches all possible forms of the word information , eg informative , informational etc.因此,这种混合表示匹配单词information的所有可能的 forms,例如informative性、 informational性等。

Longer Answer更长的答案

Now let's say you want to achieve the best of both worlds, ie a representation which allows this conflation to be done in a user controlled way, eg wrapping a word around quotes to denote requiring an exact match ( "slow" vs. slow in the query), or some indicator to include synonyms for a query term for semantic search (eg syn(slow)` to include synonyms of the word slow).现在假设您想要实现两全其美,即允许这种合并以用户控制的方式完成的表示,例如wrapping a word around quotes to denote requiring an exact match ( “慢” vs. “慢” in the query), or some indicator to include synonyms for a query term for semantic search (eg syn(slow)` 包括慢词的同义词)。

For this, you need to maintain separate postings for the non-conflated words and maintain additional equivalence indicating pointers between a set of equivalent (stem relation/synonym relation/ semantic relation etc.) terms.为此,您需要为未合并的单词维护单独的帖子,并维护附加的equivalence indicating pointers ,以指示一组equivalent (stem relation/synonym relation/ semantic relation etc.)术语之间的指针。

Coming back to our example, you would have something like:回到我们的例子,你会得到类似的东西:

(E1)-->(information) --> [D1, D5, D9,...]
 |---->(informative) --> [D9, D10, D20,...]
 |---->(data) --> [D20, D23, D25,...]


(E2)-->(retrieval) --> [D1, D9, D17,...]
 |---->(search) --> [D20, D30, D31,...]

..

Here, I have shown two examples of equivalence classes (concept representations) of two sets of terms information, data... and retrieval, search... .在这里,我展示了两组术语information, data...retrieval, search...的等价类(概念表示)的两个示例。 Depending on the query syntax, it would then be possible at the retrieval time to facilitate exact search or relaxed search (based on inflections/synonyms etc.)根据查询语法,可以在检索时促进精确搜索或轻松搜索(基于变形/同义词等)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM