同时搜索多个HashMap

Question

tldr ：如何在多个（只读）Java HashMap中同时搜索条目？

长版：

我有几个不同大小的字典，分别存储为HashMap< String, String > 。 一旦读入它们，就永远不能更改它们（严格只读）。 我想检查是否以及哪个词典用我的钥匙存储了条目。

我的代码原本是在寻找这样的键：

public DictionaryEntry getEntry(String key) {
    for (int i = 0; i < _numDictionaries; i++) {
        HashMap<String, String> map = getDictionary(i);
        if (map.containsKey(key))
             return new DictionaryEntry(map.get(key), i);
    }
    return null;
}

然后变得更加复杂：我的搜索字符串可能包含拼写错误，或者是存储条目的变体。 像，如果存储的键是“香蕉”，则有可能我会查找“香蕉”或“香蕉”，但仍然希望返回“香蕉”的条目。 现在，我使用Levenshtein-Distance遍历所有词典以及其中的每个条目：

public DictionaryEntry getEntry(String key) {
    for (int i = 0; i < _numDictionaries; i++) {
        HashMap<String, String> map = getDictionary(i);
        for (Map.Entry entry : map.entrySet) {
            // Calculate Levenshtein distance, store closest match etc.
        }
    }
    // return closest match or null.
}

到目前为止，一切正常，我正在获取想要的条目。 不幸的是，我必须在五种不同大小的字典中（大约30-70k个条目）查找7000个字符串，这需要一段时间。 从我的处理输出中，我印象很深，我的查询在整个运行时中占主导地位。

我提高运行时间的第一个想法是并行搜索所有字典。 由于没有一个字典要更改，并且一次访问一个字典的线程不超过一个，因此我看不到任何安全隐患。

问题是：我该怎么做？ 我以前从未使用过多线程。 我的搜索仅涉及Concurrent HashMaps（但据我了解，我不需要）和Runnable-class，在这里我必须将处理放入方法run() 。 我想我可以重写当前的类以适合Runnable，但是我想知道是否可能有一个更简单的方法来做到这一点（或者我怎么能简单地使用Runnable来做到这一点，目前我有限的理解认为我必须进行很多重组）。

自从我被要求分享Levenshtein-Logic：确实没什么花哨的，但是这里您去：

private int _maxLSDistance = 10;
public Map.Entry getClosestMatch(String key) {
    Map.Entry _closestMatch = null;
    int lsDist;

    if (key == null) {
        return null;
    }

    for (Map.Entry entry : _dictionary.entrySet()) {
        // Perfect match
        if (entry.getKey().equals(key)) {
            return entry;
        }
        // Similar match
        else {
            int dist = StringUtils.getLevenshteinDistance((String) entry.getKey(), key);

            // If "dist" is smaller than threshold and smaller than distance of already stored entry
            if (dist < _maxLSDistance) {
                if (_closestMatch == null || dist < _lsDistance) {
                    _closestMatch = entry;
                    _lsDistance = dist;
                }
            }
        }
    }
    return _closestMatch
}

Answer 1

为了在您的情况下使用多线程，可能是这样的：

“监视器”类，基本上存储结果并协调线程；

public class Results {

    private int nrOfDictionaries = 4; //

    private ArrayList<String> results = new ArrayList<String>();

    public void prepare() {
        nrOfDictionaries = 4;
        results = new ArrayList<String>();
    }

    public synchronized void oneDictionaryFinished() {
        nrOfDictionaries--;
        System.out.println("one dictionary finished");
        notifyAll();
    }

    public synchronized boolean isReady() throws InterruptedException {

        while (nrOfDictionaries != 0) {
            wait();
        }

        return true;
    }

    public synchronized void addResult(String result) {
        results.add(result);
    }

    public ArrayList<String> getAllResults() {
        return results;
    }
}

线程本身，可以将其设置为搜索特定的字典：

public class ThreadDictionarySearch extends Thread {

    // the actual dictionary
    private String dictionary;
    private Results results;

    public ThreadDictionarySearch(Results results, String dictionary) {
        this.dictionary = dictionary;
        this.results = results;
    }

    @Override
    public void run() {

        for (int i = 0; i < 4; i++) {
            // search dictionary;
            results.addResult("result of " + dictionary);
            System.out.println("adding result from " + dictionary);
        }

        results.oneDictionaryFinished();
    }

}

以及主要的演示方法：

public static void main(String[] args) throws Exception {

    Results results = new Results();

    ThreadDictionarySearch threadA = new ThreadDictionarySearch(results, "dictionary A");
    ThreadDictionarySearch threadB = new ThreadDictionarySearch(results, "dictionary B");
    ThreadDictionarySearch threadC = new ThreadDictionarySearch(results, "dictionary C");
    ThreadDictionarySearch threadD = new ThreadDictionarySearch(results, "dictionary D");

    threadA.start();
    threadB.start();
    threadC.start();
    threadD.start();

    if (results.isReady())
    // it stays here until all dictionaries are searched
    // because in "Results" it's told to wait() while not finished;

for (String string : results.getAllResults()) {
        System.out.println("RESULT: " + string);
    }

Answer 2

我认为最简单的方法是在条目集上使用流：

public DictionaryEntry getEntry(String key) {
  for (int i = 0; i < _numDictionaries; i++) {
    HashMap<String, String> map = getDictionary(i);

    map.entrySet().parallelStream().foreach( (entry) ->
                                     {
                                       // Calculate Levenshtein distance, store closest match etc.
                                     }
      );
  }
  // return closest match or null.
}

前提是您使用的是Java 8。 您也可以将外部循环也包装到IntStream中。 另外，您可以直接使用Stream.reduce来获得具有最小距离的条目。

Answer 3

也许尝试线程池：

ExecutorService es = Executors.newFixedThreadPool(_numDictionaries);
for (int i = 0; i < _numDictionaries; i++) {
    //prepare a Runnable implementation that contains a logic of your search
    es.submit(prepared_runnable);
}

我相信您也可以尝试快速找到完全不匹配的字符串（即长度明显不同）的估计值，并使用它来尽快完成逻辑运算，移至下一个候选者。

Answer 4

我强烈怀疑HashMaps是否适合此处，特别是如果您想使用一些模糊不清的单词时。 您应该使用适当的全文搜索解决方案，例如ElaticSearch或Apache Solr，或者至少使用可用的引擎，例如Apache Lucene 。

话虽这么说，您可以使用穷人版：创建一个地图数组和一个SortedMap，遍历该数组，获取当前HashMap的键，并将其与HashMap的索引一起存储在SortedMap中。 要检索密钥，您首先在SortedMap中搜索所述密钥，使用索引位置从数组中获取相应的HashMap，然后仅在一个HashMap中查找密钥。 应该足够快，而无需多个线程来挖掘HashMap。 但是，您可以将下面的代码变成可运行的，并且可以并行进行多个查找。

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.SortedMap;
import java.util.TreeMap;

public class Search {

    public static void main(String[] arg) {

        if (arg.length == 0) {
            System.out.println("Must give a search word!");
            System.exit(1);
        }

        String searchString = arg[0].toLowerCase();

        /*
         * Populating our HashMaps.
         */
        HashMap<String, String> english = new HashMap<String, String>();
        english.put("banana", "fruit");
        english.put("tomato", "vegetable");

        HashMap<String, String> german = new HashMap<String, String>();
        german.put("Banane", "Frucht");
        german.put("Tomate", "Gemüse");

        /*
         * Now we create our ArrayList of HashMaps for fast retrieval
         */

        List<HashMap<String, String>> maps = new ArrayList<HashMap<String, String>>();
        maps.add(english);
        maps.add(german);


        /*
         * This is our index
         */
        SortedMap<String, Integer> index = new TreeMap<String, Integer>(String.CASE_INSENSITIVE_ORDER);


        /*
         * Populating the index:
         */
        for (int i = 0; i < maps.size(); i++) {
            // We iterate through or HashMaps...
            HashMap<String, String> currentMap = maps.get(i);

            for (String key : currentMap.keySet()) {
                /* ...and populate our index with lowercase versions of the keys,
                 * referencing the array from which the key originates.
                 */ 
                index.put(key.toLowerCase(), i);
            }

        }


         // In case our index contains our search string...
        if (index.containsKey(searchString)) {

            /* 
             * ... we find out in which map of the ones stored in maps
             * the word in the index originated from.
             */
            Integer mapIndex = index.get(searchString);

            /*
             * Next, we look up said map.
             */
            HashMap<String, String> origin = maps.get(mapIndex);

            /*
             * Last, we retrieve the value from the origin map
             */

            String result = origin.get(searchString);

            /*
             * The above steps can be shortened to
             *  String result = maps.get(index.get(searchString).intValue()).get(searchString);
             */

            System.out.println(result);
        } else {
            System.out.println("\"" + searchString + "\" is not in the index!");
        }
    }

}

请注意，这只是一个天真的实现，仅用于说明目的。 它不能解决几个问题（例如，您不能有重复的索引条目）。

使用此解决方案，您基本上是以启动速度为查询速度。

Answer 5

好的！！..

由于您的关注是为了获得更快的响应。

我建议您在线程之间划分工作。

让您拥有5个词典可以将三个词典保留在一个线程中，剩下的两个则由另一个线程照顾。 然后，女巫曾经发现该匹配项将终止或终止另一个线程。

可能是您需要一种额外的逻辑来完成工作划分……但是这不会影响您的演奏时间。

可能您需要在代码中进行更多更改才能获得紧密匹配：

for (Map.Entry entry : _dictionary.entrySet()) {

您正在使用EntrySet但无论如何您都没有使用值，似乎获得条目集有点昂贵。 我建议您只使用keySet因为您实际上对该地图中的values不感兴趣

 for (Map.Entry entry : _dictionary.keySet()) {

有关地图性能的更多详细信息，请阅读此链接地图表演

在LinkedHashMap的集合视图上进行迭代需要的时间与地图的大小成正比，而不管其容量如何。 在HashMap上进行迭代可能会更昂贵，需要的时间与其容量成正比。

同时搜索多个HashMap

问题描述

5 个解决方案

解决方案1
2 已采纳 2015-07-30 11:27:00

解决方案2
0 2015-07-30 11:27:54

解决方案3
0 2015-07-30 11:28:26

解决方案4
0 2015-07-30 13:32:17

解决方案5
0 2015-08-01 16:03:32

同时搜索多个HashMap

问题描述

5 个解决方案

解决方案1 2 已采纳 2015-07-30 11:27:00

解决方案2 0 2015-07-30 11:27:54

解决方案3 0 2015-07-30 11:28:26

解决方案4 0 2015-07-30 13:32:17

解决方案5 0 2015-08-01 16:03:32

解决方案1
2 已采纳 2015-07-30 11:27:00

解决方案2
0 2015-07-30 11:27:54

解决方案3
0 2015-07-30 11:28:26

解决方案4
0 2015-07-30 13:32:17

解决方案5
0 2015-08-01 16:03:32