简体   繁体   English

建立倒排索引的最佳方法是什么?

[英]what is the best way to build inverted index?

I'm building a small web search engine for searching about 1 million web pages and I want to know What is the best way to build the inverted index ? 我正在构建一个用于搜索大约一百万个网页的小型Web搜索引擎,我想知道构建反向索引的最佳方法是什么? using the DBMS or What …? 使用DBMS还是...? from many different views like storage cost, performance, speed of indexing and query? 从许多不同的角度来看,例如存储成本,性能,索引编制和查询速度? and I don't want to use any open source project for that I want to make my own one! 我不想为此使用任何开源项目!

Most of the current closed-source database managers have some sort of full-text indexing capability. 当前大多数封闭源数据库管理器都具有某种全文索引功能。 Given its popularity, I'd guess most also have pre-written filters for HTML so searching for something like <p> won't give 1000 hits for every web page. 鉴于其受欢迎程度,我想大多数人也都为HTML预写了过滤器,因此搜索<p>内容不会为每个网页带来1000次匹配。

If you want to do the job entirely on your own, filtering the HTML is probably the single hardest part. 如果您想完全自己完成这项工作,则过滤HTML可能是最困难的部分。 From there, an inverted index takes a lot of text processing, and produces a large result, but it's basically pretty simple -- you just scan through all the documents, and build a list of words and their locations (usually after filtering out extremely common words like "a", "an", "and", etc., that won't be meaningful search terms) then put those all together into one big index. 从那里开始,倒排索引需要大量的文本处理,并且会产生很大的结果,但这基本上很简单-您只需浏览所有文档,并建立单词及其位置的列表(通常在过滤掉非常常见的单词后这样的单词(例如“ a”,“ an”,“ and”等)将不会成为有意义的搜索字词),然后将它们全部合并为一个大索引。

Given the size of the full index, it's often useful to add a second level index that's small enough that you can be sure it'll easily fit into real memory (eg restrict it to a few hundred entries or so). 给定完整索引的大小,添加第二级索引通常足够有用,该二级索引必须足够小,以确保可以轻松地将其放入实际内存中(例如,将其限制为几百个条目左右)。 A really small (but somewhat ineffective) version just goes by the first letters of words, so the "A" words start at 0, "B" at 12345, "C" at 34567, and so on. 一个非常小的(但有些无效)的版本仅以单词的首字母开头,因此“ A”单词从0开始,“ B”从12345开始,“ C”从34567继续,依此类推。 That isn't very effective though -- you get a lot more words that start with "A" than with "X", for example. 但这并不是很有效-例如,以“ A”开头的单词比以“ X”开头的单词要多得多。 It's more effective to build your index, and then pick a few hundred (or whatever) words that are evenly spaced throughout the index. 建立索引,然后选择在整个索引中均匀分布的几百个(或任何其他)单词会更有效。 Then use that as your first-level index. 然后将其用作您的第一级索引。 In theory, you could get considerably more elaborate, such as something like a B+ tree, but that's usually overkill -- out of a million documents, chances are that you'll end up with fewer than a hundred thousand words that are used often enough to make much difference to the index size. 从理论上讲,您可能会变得更加复杂,例如B +树之类的东西,但这通常是过大的了-在一百万个文档中,最终您会得到少于十万个单词,而这些单词经常被使用得足够多使索引大小有很大差异。 Even at that, quite a few of the entries will be things like typos, not real words... 即使这样,相当多的条目也会是拼写错误,而不是真实的单词……

也许您可能想详细说明为什么不希望使用Lucene或Sphinx之类的F / OSS工具。

I think this book has your answer if you still looking for it. 如果您仍在寻找这本书,我认为这是您的答案。

http://nlp.stanford.edu/IR-book/information-retrieval-book.html http://nlp.stanford.edu/IR-book/information-retrieval-book.html

You may want to start with Hadoop. 您可能要开始使用Hadoop。 It will distribute your index building effectively over the cluster. 它将在整个集群上有效地分布索引构建。 You can use any language for it. 您可以使用任何语言。 Java and Python are recommended. 推荐使用Java和Python。 Using Hadoop/MapReduce, you can easily index your web pages. 使用Hadoop / MapReduce,您可以轻松索引网页。 But they will need to be cached/stored on a disk and you would require a parser/tokenizer to extract text first. 但是它们将需要缓存/存储在磁盘上,并且您需要解析器/令牌生成器首先提取文本。 There are some freely available parsers on the net. 网上有一些免费的解析器。 You can start from here if you want to do it manually. 如果要手动执行,可以从这里开始。 Once you have an index, then storing it is another task. 有了索引后,存储它是另一项任务。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM