简体   繁体   English


[英]How to build a simple inverted index?

I wanna build a simple indexing function of search engine without any API, such as Lucene. 我想在没有任何API的情况下构建一个简单的搜索引擎索引功能,例如Lucene。 In the inverted index, I just need to record basic information of each word, eg docID, position, and freqence. 在倒排索引中,我只需要记录每个单词的基本信息,例如docID,position和freqence。

Now, I have several questions: 现在,我有几个问题:

  1. What kind of data structure is often used for building inverted index? 什么样的数据结构经常用于构建倒排索引? Multidimensional list? 多维列表?

  2. After building the index, how to write it into files? 构建索引后,如何将其写入文件? What kind of format in the file? 文件中有哪种格式? Like a table? 像一张桌子? Like drawing a index table on paper? 就像在纸上画一个索引表一样?

You can see a very simple implementation of inverted index and search in TinySearchEngine . 你可以在TinySearchEngine中看到一个非常简单的倒排索引和搜索实现

For your first question, if you want to build a simple (in memory) inverted index the straightforward data structure is a Hash map like this: 对于你的第一个问题,如果你想构建一个简单的(内存中)倒排索引,那么简单的数据结构就像这样的Hash映射:

val invertedIndex = new collection.mutable.HashMap[String, List[Posting]]

or a Java-esque: 或Java-esque:

HashMap<String, List<Posting>> invertedIndex = new HashMap<String, List<Postring>>();

The hash maps each term/word/token to a list of Postings. 哈希将每个术语/单词/标记映射到过帐列表。 A Posting is just an object that represents an occurrence of a word inside a document: Posting只是一个对象,表示文档中单词的出现:

case class Posting(docId:Int, var termFrequency:Int)

Indexing a new document is just a matter of tokenizing it (separating in tokens/words) and for each token insert a new Posting in the correct List of the hash map. 索引新文档只需将其标记(用标记/单词分隔),并为每个标记在哈希映射的正确列表中插入新的过帐。 Of course, if a Posting already exists for that term in that specific docId, you increase the termFrequency. 当然,如果该特定docId中的该术语已存在,则增加termFrequency。 There are other ways of doing this. 还有其他方法可以做到这一点。 For in memory inverted indexes this is OK, but for on-disk indexes you'd probably want to insert Postings once with the correct termFrequency instead of updating it every time. 对于在内存中的倒排索引,这是好的,但对磁盘上的索引你可能想插入Postings与正确的一次termFrequency不是每次都更新它的。

Regarding your second question, there are normally two cases: 关于你的第二个问题,通常有两种情况:

(1) you have an (almost) immutable index. (1)你有一个(几乎)不可变的索引。 You index all your data once and if you have new data you can just reindex. 您可以将所有数据编入索引一次,如果有新数据,则可以重新编制索引。 There is no need to real-time or indexing many times in an hour, for example. 例如,不需要在一小时内多次实时或索引。

(2) new documents arrive all the time, and you need to search the newly arrived documents as soon as possible. (2)新文件一直到达,您需要尽快搜索新到的文件。

For case (1), you can have at least 2 files: 对于情况(1),您可以拥有至少2个文件:

1 - The Inverted Index file. 1 - 反向索引文件。 It lists for each term all Postings (docId/termFrequency pairs). 它为每个术语列出所有过Postings (docId / termFrequency对)。 Here represented in plain text, but normally stored as binary data. 这里用纯文本表示,但通常存储为二进制数据。


2- The offset file. 2-偏移文件。 Stores for each term the offset to find its inverted list in the inverted index file. 存储每个术语的偏移量,以在倒排索引文件中查找其反转列表。 Here I'm representing the offset in characters but you'll normally store binary data, so the offset will be in bytes. 这里我用字符表示偏移量但你通常会存储二进制数据,因此偏移量将以字节为单位。 This file can be loaded to memory at startup time. 此文件可以在启动时加载到内存。 When you need to lookup a term inverted list, you lookup its offset and read the inverted list from the file. 当您需要查找术语反转列表时,可以查找其偏移量并从文件中读取反转列表。

Term1 -> 0
Term2 -> 126
Term3 -> 222

Along with this 2 files you can (and generally will) have file(s) to store each term's IDF and each document's norm. 除了这两个文件,你可以(通常会)有文件来存储每个术语的IDF和每个文档的规范。

For case (2), I'll try to briefly explain how Lucene (and consequently Solr and ElasticSearch ) do it. 对于情况(2),我将尝试简要解释Lucene (以及SolrElasticSearch )是如何做到的。

The file format can be the same as explained above. 文件格式可以与上面解释的相同。 The main difference is when you index new documents in systems like Lucene instead of rebuilding the index from scratch they just create a new one with only the new documents. 主要区别在于,在Lucene等系统中索引新文档而不是从头开始重建索引时,只需创建一个只包含新文档的新文档。 So every time you have to index something, you do it in a new separated index. 因此,每次必须索引某些内容时,都要在新的分离索引中进行索引。

To perform a query in this "splitted" index you can run the query against each different index (in parallel) and merge the results together before returning to the user. 要在此“拆分”索引中执行查询,您可以针对每个不同的索引(并行)运行查询,并在返回给用户之前将结果合并在一起。

Lucene calls this "little" indexes segments . Lucene称这个“小”索引segments

The obvious concern here is that you'll get a lot of little segments very quick. 这里显而易见的问题是,你会很快得到很多小段。 To avoid this, you'll need a policy for merging segments and creating larger segments. 为避免这种情况,您需要一个合并细分和创建更大细分的策略。 For example, if you have more than N segments you can decide to merge all segments smaller than 10 KBs together. 例如,如果您有多于N segments ,则可以决定将小于10 KBs所有段合并在一起。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM