简体   繁体   English

优化倒排索引Java

[英]Optimize inverted index Java

I am trying to create an inverted index for Wikipedia pages however I keep running out of memory. 我正在尝试为Wikipedia页面创建反向索引,但是我一直用光内存。 I am not sure what else I can to do ensure it doesn't run out of memory. 我不确定该怎么办才能确保它不会耗尽内存。 However we are talking about 3.9Mil words. 但是,我们谈论的是390万字。

indexer.java

public void index() {
    ArrayList<Page> pages = parse(); // Parse XML pages
    HashMap<String, ArrayList<Integer>> postings = getPostings(pages);
}

public HashMap<String, ArrayList<Integer>> getPostings(ArrayList<Page> pages) {
    assert pages != null;

    englishStemmer stemmer = new englishStemmer();
    HashSet<String> stopWords = getStopWords();
    HashMap<String, ArrayList<Integer>> postings = new HashMap<>();
    int count = 0;
    int artCount = 0;

    for (Page page : pages) {

        if (!page.isRedirect()) { // Skip pages that are redirects.

            StringBuilder sb = new StringBuilder();
            artCount = count; // All the words until now
            boolean ignore = false;

            for (char c : page.getText().toCharArray()) {

                if (c == '<') // Ignore words inside <> tags.
                    ignore = true;

                if (!ignore) {
                    if (c != 39) {
                        if (c > 47 && c < 58 || c > 96 && c < 123) // Character c is a number 0-9 or a lower case letter a-z.
                            sb.append(c);

                        else if (c > 64 && c < 91) // Character c is an uppercase letter A-Z.
                            sb.append(Character.toLowerCase(c));

                        else if (sb.length() > 0) { // Check if there is a word up until now.

                            if (sb.length() > 1) { // Ignore single character "words"

                                if (!stopWords.contains(sb.toString())) { // Check if the word is not a stop word.

                                    stemmer.setCurrent(sb.toString());
                                    stemmer.stem(); // Stem word s

                                    String s = sb.toString(); // Retrieve the stemmed word

                                    if (!postings.containsKey(s)) // Check if the word already exists in the words map.
                                        postings.put(s, new ArrayList<>()); // If the word is not in the map then create an array list for that word.
                                    postings.get(s).add(page.getId()); // Place the id of the page in the word array list.
                                    count++; // Increase the overall word count for the pages
                                }
                            }
                            sb = new StringBuilder();
                        }
                    }
                }

                if (c == '>')
                    ignore = false;
            }
        }
        page.setCount(count - artCount);
    }
    System.out.println("Word count:" + count);
    return postings;
}

Advantages 好处

Some advantages for this approach are: 这种方法的一些优点是:

  • You can get the number of occurrences of a given word simply by getting the size of the associated ArrayList. 您可以简单地通过获取关联的ArrayList的大小来获得给定单词出现的次数。
  • Looking up the number of times a given word occurs in a page is relatively easy. 查找给定单词在页面中出现的次数相对容易。

Optimizations 最佳化

Current optimizations: 当前优化:

  • Ignoring common words (stop words). 忽略常用词(停用词)。
  • Stemming words to their roots and storing those. 将单词词根扎根并存储。
  • Ignoring common Wikipedia tags that aren't English words (included in stop word list such as: lt, gt, ref .. etc). 忽略不是英语单词的普通Wikipedia标签(包括在停用词列表中,例如lt,gt,ref ..等)。
  • Ignoring text within < > tags such as: <pre>, <div> 忽略< >标记内的文本,例如: <pre>, <div>

Limitations 局限性

Array lists become incredibly large with number of occurrences for words, the major disadvantage of this approach comes when an array list has to grow. 数组列表变得非常大,出现单词的次数越来越多,这种方法的主要缺点在于数组列表必须增长时。 A new array list is created and the items from the previous array list need to be copied into the new array list. 创建新的阵列列表,并且需要将先前阵列列表中的项目复制到新的阵列列表中。 This could be a possible performance bottleneck. 这可能是性能瓶颈。 Would a Linked list make more sense here? 链接列表在这里更有意义吗? As we are simply adding more occurrences and not reading the occurrences. 因为我们只是添加更多的事件,而不是读取事件。 This would also mean that since linked lists do not rely on an array as their underlying data structure they can grow without bounds and do not need to be replaced when they are too large. 这也意味着,由于链接列表不依赖数组作为其底层数据结构,因此它们可以无限制地增长,并且当它们太大时不需要替换。

Alternative approaches 替代方法

I have considered dumping the counts for each word into a database like MongoDB after each page has been processed and then append the new occurrences. 我已经考虑过在处理完每个页面之后将每个单词的计数转储到MongoDB这样的数据库中,然后附加新出现的内容。 It would be: {word : [occurrences]} and then let the GC clean postings HashMap after each page has been processed. 它将是: {word : [occurrences]} ,然后在处理完每个页面之后,让GC清理postings HashMap。

I've also considered moving the pages loop into the index() method such that GC can clean up getPostings() before a new page. 我还考虑过将页面循环移到index()方法中,以便GC可以在新页面之前清理getPostings() Then merging the new postings after each page but I don't think that will alleviate the memory burden. 然后在每页之后合并新的postings ,但我认为这不会减轻内存负担。

As for the hash maps would a tree map be a better fit for this situation? 至于哈希图,树形图是否更适合这种情况?

Execution 执行

On my machine this program runs on all 4 cores using 90 - 100% and takes about 2-2.5GB RAM. 在我的机器上,该程序使用90-100%的速度在所有4个内核上运行,并占用大约2-2.5GB的RAM。 It runs for over an hour and a half then: GC Out of memory . 然后,它将运行一个半小时以上: GC Out of memory

I have also considered increasing the available memory for this program but it needs to run on my instructors machine as well. 我也考虑过增加此程序的可用内存,但是它也需要在我的讲师机器上运行。 So it needs to operate as standard without any "hacks". 因此,它需要作为标准操作而没有任何“ hacks”。

I need help making considerable optimizations, I'm not sure what else would help. 我需要进行大量优化的帮助,但我不确定还有什么帮助。

TL;DR Most likely your data structure won't fit in memory, no matter what you do. TL; DR不管您做什么,很可能您的数据结构都无法容纳在内存中。

Side note: you should actually explain what your task is and what your approach is. 旁注:您应该实际解释您的任务和方法。 You don't do that and expect us to read and poke in your code. 您不会这样做,并希望我们阅读并戳入您的代码。

What you're basically doing is building a multimap of word -> ids of Wikipedia articles. 您基本上要做的是构建单词-> Wikipedia文章ID的多图。 For this, you parse each non-redirect page, divide it into single words and build a multimap by adding a word -> page id mapping. 为此,您解析每个非重定向页面,将其分为单个单词,然后通过添加单词->页面ID映射来构建多图。

Let's roughly estimate how big that structure would be. 让我们粗略估计一下该结构的大小。 Your assumption is around 4 millions of words. 您的假设约为400万个单词。 There's around 5 millions of articles in EN Wikipedia. EN Wikipedia中大约有500万篇文章。 Average word length in English is around 5 characters, so let's assume 10 bytes per word, 4 bytes per article id. 英文的平均单词长度约为5个字符,因此我们假设每个单词10个字节,每个文章ID 4个字节。 We're getting around 40 MB for words (keys in the map), 20 MB for article ids (values in the map). 单词(地图中的键)为40 MB,文章ID(地图中的值)为20 MB。
Assuming a multihashmap-like structure you could estimate the hashmap size at around 32*size + 4*capacity. 假设类似multihashmap的结构,您可以估计hashmap的大小约为32 * size + 4 *容量。

So far this seems to be manageable, a few dozen MBs. 到目前为止,这似乎是可管理的,只有几十MB。

But there will be around 4 millions collections to store ids of articles, each will be around 8*size (if you'll take array lists), where the size is a number of articles a word will be encountered in. According to http://www.wordfrequency.info/ , the top 5000 words are mentioned in COCAE over 300 million times, so I'd expect Wikipedia to be in this range. 但是,将有大约400万馆藏的物品商店的ID,每个将围绕8 *尺寸(如果你需要数组列表),其中大小是一个数字的文章一个字会遇到据。 HTTP: //www.wordfrequency.info/ ,在COCAE中提及的前5000个单词超过3亿次,因此我希望Wikipedia处于该范围内。
That would be around 2.5 GB just for article ids just for 5k top words. 仅用于5k个热门单词的文章ID大约为2.5 GB。 This is a good hint that your inverted index structure will probably take too much memory to fit on a single machine. 这很好地暗示了您的倒排索引结构可能会占用太多内存,无法容纳在一台计算机上。

However I don't think that the you've got problems with the size of the resulting structure. 但是我认为结果结构的大小没有问题。 Your code indicates that you load pages in memory first and process them later on. 您的代码表明您先将页面加载到内存中,然后再对其进行处理。 And that definitely won't work. 那绝对是行不通的。

You'll most probably need to process pages in a stream-like fashion and use some kind of a database to store results. 您很可能需要以类似流的方式处理页面,并使用某种数据库来存储结果。 There's basically a thousand ways to do that, I'd personally go with a Hadoop job on AWS with PostgreSQL as the database, utilizing the UPSERT feature. 基本上有一千种方法可以实现,我个人使用UPSERT功能在AWS上以PostgreSQL为数据库进行Hadoop工作。

ArrayList is a candidate for replacement by a class Index you'll have to write. ArrayList可以替换为您必须编写的类Index。 It should use int[] for storing index values and a reallocation strategy that uses an increment based on the overall growth rate of the word it belongs to. 它应该使用int []存储索引值,并使用重新分配策略,该策略根据其所属单词的整体增长率使用增量。 (ArrayList increments by 50% of the old value, and this may not be optimal for rare words.) Also, it should leave room for optimizing the storage of ranges by storing the first index and the negative count of following numbers, eg, (ArrayList会增加旧值的50%,这对于稀有单词而言可能不是最佳选择。)此外,它还应通过存储第一个索引和后续数字的负数来留出空间来优化范围存储,例如,

..., 100, -3,...   is index values for 100, 101, 102, 103

This may result in saving entries for frequently occurring words at the cost of a few cycles. 这可能导致以几个周期为代价保存频繁出现的单词的条目。

Consider a dump of the postings HashMap after entering a certain number of index values and a continuation with an empty map. 在输入一定数量的索引值并使用空映射继续后,考虑转储HashMap的转储。 If the file is sorted by key, it'll permit a relatively simple merge of two or more files. 如果文件按键排序,则将允许相对简单地合并两个或多个文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM