在Java项目中使用scala的ParHashMap代替ConcurrentHashMap

Question

I've got a fairly complicated project, which heavily uses Java's multithreading. 我有一个相当复杂的项目，该项目大量使用Java的多线程。 In an answer to one of my previous questions I have described an ugly hack, which is supposed to overcome inherent inability to iterate over Java's ConcurrentHashMap in parallel. 在回答我以前的一个问题时，我描述了一个丑陋的骇客，它可以克服固有的无法ConcurrentHashMap迭代Java的ConcurrentHashMap的能力。 Although it works, I don't like ugly hacks, and I've had a lot of trouble trying to introduce proposed proof of concept in the real system. 尽管它可以工作，但我不喜欢难看的黑客工具，并且在实际系统中引入建议的概念验证时遇到了很多麻烦。 Trying to find an alternative solution I have encountered Scala's ParHashMap , which claims to implement a foreach method, which seems to operate in parallel. 在尝试找到替代解决方案时，我遇到了Scala的ParHashMap ，它声称实现了一个foreach方法，该方法似乎可以并行运行。 Before I start learning a new language to implement a single feature I'd like to ask the following: 在开始学习新语言以实现单个功能之前，我想问以下问题：

1) Is foreach method of Scala's ParHashMap scalable? 1）Scala的ParHashMap foreach方法ParHashMap可扩展？

2) Is it simple and straightforward to call Java's code from Scala and vice versa? 2）从Scala调用Java代码是否简单明了？反之亦然？ I'll just remind that the code is concurrent and uses generics. 我只是提醒一下，代码是并发的，并且使用泛型。

3) Is there going to be a performance penalty for switching a part of codebase to Scala? 3）将部分代码库切换到Scala是否会有性能损失？

For reference, this is my previous question about parallel iteration of ConcurrentHashMap : 供参考，这是我先前有关ConcurrentHashMap并行迭代的问题：

Scalable way to access every element of ConcurrentHashMap<Element, Boolean> exactly once 一次访问ConcurrentHashMap <Element，Boolean>的每个元素的可扩展方式

EDIT 编辑

I have implemented the proof of concept, in probably very non-idiomatic Scala, but it works just fine. 我已经在非常不习惯的Scala中实现了概念验证，但是效果很好。 AFAIK it is IMPOSSIBLE to implement a corresponding solution in Java given the current state of its standard library and any available third-party libraries. 众所周知，鉴于其标准库和任何可用的第三方库的当前状态，在Java中实现相应的解决方案是不可能的。

import scala.collection.parallel.mutable.ParHashMap

class Node(value: Int, id: Int){
    var v = value
    var i = id
    override def toString(): String = v toString
}

object testParHashMap{
    def visit(entry: Tuple2[Int, Node]){
        entry._2.v += 1
    }
    def main(args: Array[String]){
        val hm = new ParHashMap[Int, Node]()
        for (i <- 1 to 10){
            var node = new Node(0, i)
            hm.put(node.i, node)
        }

        println("========== BEFORE ==========")
        hm.foreach{println}

        hm.foreach{visit}

        println("========== AFTER ==========")
        hm.foreach{println}

    }
}

Answer 1

I come to this with some caveats: 我对此有一些警告：

Though I can do some things, I consider myself relatively new to Scala. 尽管我可以做一些事情，但我认为自己对Scala还是比较陌生的。
I have only read about but never used the par stuff described here . 我只读过但从未使用过此处介绍的par资料。
I have never tried to accomplish what you are trying to accomplish. 我从未尝试完成您要完成的任务。

If you still care what I have to say, read on. 如果您仍然在乎我要说的话，请继续阅读。

First, here is an academic paper describing how the parallel collections work. 首先，这是一篇描述平行馆藏如何运作的学术论文。

On to your questions. 关于您的问题。

1) When it comes to multi-threading, Scala makes life so much easier than Java. 1）在多线程方面，Scala使生活比Java容易得多。 The abstractions are just awesome. 抽象太棒了。 The ParHashMap you get from a par call will distribute the work to multiple threads. 从par调用中获得的ParHashMap将把工作分配给多个线程。 I can't say how that will scale for you without a better understanding of your machine, configuration, and use case, but done right (particularly with regard to side effects) it will be at least as good as a Java implementation. 我不能说在不更好地了解您的机器，配置和用例的情况下如何为您扩展规模，但是如果做对了（特别是在副作用方面），它将至少与Java实现一样好。 However, you might also want to look at Akka to have more control over everything. 但是，您可能还希望查看Akka，以更好地控制所有内容。 It sounds like that might be more suitable to your use case than simply ParHashMap . 听起来这可能比简单的ParHashMap更适合您的用例。

2) It is generally simple to convert between Java and Scala collections using JavaConverters and the asJava and asScala methods. 2）使用JavaConverters以及asJava和asScala方法在Java和Scala集合之间进行转换通常很简单。 I would suggest though making sure that the public API for your method calls "looks Java" since Java is the least common denominator. 我还是建议您确保方法的公共API调用“ looks Java”，因为Java是最小公分母。 Besides, in this scenario, Scala is an implementation detail, and you never want to leak those anyway. 此外，在这种情况下，Scala是实现细节，您无论如何都不想泄漏这些信息。 So keep the abstraction at a Java level. 因此，将抽象保持在Java级别。

3) I would guess there will actually be a performance gain with Scala--at runtime. 3）我想Scala在运行时实际上会提高性能。 However, you will find much slower compile time (which can be worked around. ish). 但是，您会发现编译时间要慢得多（可以解决。ish）。 This Stack Overflow post by the author of Scala is old but still relevant. Scala的作者在Stack Overflow上发表的文章虽然古老，但仍然有意义。

Hope that helps. 希望能有所帮助。 That's quite a problem you got there. 那是一个相当大的问题。

Answer 2

Since Scala compiles to the same bytecode as Java, doing the same in both languages is very well possible, no matter the task. 由于Scala可以编译为与Java相同的字节码，因此无论执行何种任务，都可以用两种语言进行相同的操作。 There are however some things which are easier to solve in Scala, but if this is worth learning a new language is a different question. 但是，有些事情在Scala中更容易解决，但是如果值得学习一门新语言则是另外一个问题。 Especially since Java 8 will include exactly what you ask for: simple parallel execution of functions on lists. 尤其是因为Java 8将完全包含您所要求的：列表上函数的简单并行执行。

But even now you can do this in Java, you just need to write what Scala already has on your own. 但是即使现在您可以用Java做到这一点，您只需要自己编写Scala已有的内容即可。

final ExecutorService executor = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
//...
final Entry<String, String>[] elements = (Entry<String, String>[]) myMap.entrySet().toArray();
final AtomicInteger index = new AtomicInteger(elements.length);

for (int i = Runtime.getRuntime().availableProcessors(); i > 0; --i) {
  executor.submit(new Runnable() {

    public void run() {
      int myIndex;
      while ((myIndex = index.decrementAndGet()) >= 0) {
        process(elements[myIndex]);
      }
    }
  });
}

The trick is to pull those elements into a temporary array, so threads can take out elements in a thread-safe way. 诀窍是将这些元素拉到临时数组中，以便线程可以以线程安全的方式取出元素。 Obviously doing some caching here instead of re-creating the Runnables and the array each time is encouraged, because the Runnable creation might already take longer than the actual task. 显然，建议在此处进行一些缓存，而不是每次都重新创建Runnable和数组，因为Runnable的创建可能已经比实际任务花费了更长的时间。

It is as well possible to instead copy the elements into a (reusable) LinkedBlockingQueue, then have the threads poll/take on it instead. 也可以将元素复制到（可重复使用的）LinkedBlockingQueue中，然后让线程对其进行轮询/处理。 This however adds more overhead and is only reasonable for tasks that require at least some calculation time. 但是，这增加了更多的开销，并且仅对于需要至少一些计算时间的任务才是合理的。

I don't know how Scala actually works, but given the fact that it needs to run on the same JVM, it will do something similar in the background, it just happens to be easily accessible in the standard library. 我不知道Scala实际如何工作，但是鉴于它需要在同一JVM上运行，因此它将在后台执行类似的操作，只是在标准库中易于访问。

在Java项目中使用scala的ParHashMap代替ConcurrentHashMap

问题描述

2 个解决方案

解决方案1
1 已采纳 2013-11-23 17:18:12

解决方案2
0 2013-11-24 13:30:44

在Java项目中使用scala的ParHashMap代替ConcurrentHashMap

问题描述

2 个解决方案

解决方案1 1 已采纳 2013-11-23 17:18:12

解决方案2 0 2013-11-24 13:30:44

解决方案1
1 已采纳 2013-11-23 17:18:12

解决方案2
0 2013-11-24 13:30:44