简体   繁体   English

ConcurrentHashMap的实现和限制

[英]ConcurrentHashMap implementation and limitations

I have quite a large project to accomplish and I'm running into some dead ends. 我有一个很大的项目要完成,我遇到了一些死胡同。 I wanted to see if the great community here had any suggestions. 我想看看这里的社区是否有任何建议。

I have a large data set and I'm attempting to build a social graph. 我有一个大型数据集,我正在尝试构建一个社交图。 The data contains over 9.5 million mappings of coordinates to a Short value. 该数据包含超过950万个坐标到Short值的映射。 For the key values in the ConcurrentHashMap I am using a String, that is the coordinates concatenated with a ',' in between. 对于ConcurrentHashMap中的键值,我使用的是String,它是与中间的','连接的坐标。

Essentially, I'm finding the number of groups in common between users. 基本上,我发现用户之间的共同组数。 I have an initial hashmap that is built quite easily that maps a GroupID to a Vector of AvatarID's. 我有一个很容易构建的初始hashmap,它将GroupID映射到AvatarID的Vector。 This part runs fine. 这部分运行正常。 Then, I have 12 threads which are responsible for their own set of GroupIDs and processing (adding + 1 to the count between users in each groupID), all the accessing done from the ConcurrentHashMap. 然后,我有12个线程负责他们自己的GroupID和处理集(在每个groupID中的用户之间添加+ 1),所有访问都是从ConcurrentHashMap完成的。

After about 8000 groups have been processed, a problem with the accessing occurs. 在处理了大约8000个组之后,发生访问问题。 Only one thread at a time seems active, and I'm unsure if this is because of the massive size or another factor. 一次只有一个线程似乎有效,我不确定这是因为大尺寸还是其他因素。 This is a problem as I have 300,000 groups that need to be processed total (and in a timely manner). 这是一个问题,因为我有300,000个需要处理的组(并且及时)。

Is there any advice as to how I'm implementing this, and any shortcuts I can use? 关于我如何实现这个,以及我可以使用的任何快捷方式,是否有任何建议? The read and write I believe is equally important, as I have to read a coordinate if the value exists (if not create it) and then add one to the value and write back. 我相信读取和写入同样重要,因为如果值存在(如果不创建它),我必须读取坐标,然后在值中添加一个并写回。

I am willing to provide code as needed, I just don't know which portions will be relevant to the discussion yet. 我愿意根据需要提供代码,我只是不知道哪些部分与讨论有关。

Thanks for your time, -mojavestorm 谢谢你的时间,-mojavestorm

Further explanation: 进一步说明:

Two Implementations and their limits: 两个实现及其限制:

1) I have a HashMap(Integer, Vector(Integer)) preMap that contains a GroupID as key and a Vector of userIDs. 1)我有一个HashMap(整数,向量(整数))preMap,其中包含GroupID作为键和一个userIDs的Vector。 The threads split up the GroupIDs between each other and using each Vector(Integer) returned, each thread stores a short value according to a coordinate (saying UserID x and UserID y belong in (short) n groups together) into a TLongShortHashMap threadMap, and each thread owns its own threadMap. 线程在彼此之间拆分GroupID并使用返回的每个Vector(整数),每个线程根据坐标(将UserID x和UserID y属于(short)n组一起存储)存储到TLongShortHashMap threadMap中的短值,并且每个线程拥有自己的threadMap。 The coordinates are mapped to long values. 坐标映射到长值。 After each thread is completed, the values of corresponding keys in each of the threadMaps are added to the same key in a combinedMap, which will show how many groups UserID x and UserID y belong to together in the whole system. 在每个线程完成之后,每个threadMaps中的相应键的值被添加到组合Map中的相同键,这将显示在整个系统中UserID x和UserID y一起属于多少组。

The problem with this implementation is that there is high overlap between threads, so excessive short values are created. 此实现的问题在于线程之间存在高重叠,因此会创建过多的短值。 For example User 1 and User 2 belong to various groups together. 例如,用户1和用户2一起属于各种组。 Thread A and Thread B are responsible for a their own range of groups, including ones User 1 and User 2 belong to, so both Thread A and Thread B store in their copy of the threadMap a long value for coordinate (1, 2) and a short value. 线程A和线程B负责它们自己的组范围,包括用户1和用户2所属的组,因此线程A和线程B都在它们的threadMap副本中存储一个长坐标值(1,2)和一个短的价值。 If excessive overlap occurs then the memory requirement can be outstanding. 如果发生过度重叠,则内存要求可能非常突出。 In my case, all 46GB of ram I allocate to Java get used up, and quite quickly too. 在我的例子中,我分配给Java的所有46GB ram都用完了,而且很快就用完了。

2) Using the same preMap in this implementation, each thread is given a range of user coordinates they are responsible for. 2)在此实现中使用相同的preMap,每个线程都有一系列他们负责的用户坐标。 Each thread runs, and takes each coordinate it has and iterates through preMap, checking each groupID and seeing if UserID x and UserID y belong to the vector returned from the preMap. 每个线程运行,并获取它拥有的每个坐标并遍历preMap,检查每个groupID并查看UserID x和UserID y是否属于从preMap返回的向量。 This implementation eliminates overlap that will occur between threadMaps. 此实现消除了threadMaps之间将发生的重叠。

The problem with this is time. 这个问题是时间问题。 Right now the program is running at a stunning rate of 1400 years to complete. 目前,该计划以1400年的惊人速度运行完成。 The memory used wavers around 4GB to 15GB but seems to stay 'low'. 使用的内存大约4GB到15GB,但似乎保持“低”。 Not completely sure it will stay within the limit, however, I imagine it will. 不完全确定它会保持在极限范围内,但是,我想它会。 There are no improvements that are apparent to me. 对我来说没有明显的改进。

Hopefully these descriptions are clear and will help give insight to my problem. 希望这些描述清晰明确,有助于深入了解我的问题。 Thanks. 谢谢。

I would have each thread process its own Map. 我会让每个线程处理自己的Map。 This means each thread can work interdependently. 这意味着每个线程可以相互依赖地工作。 Once the threads have finished you can combine all the results. 线程完成后,您可以组合所有结果。 (Or possibly combine the results as they complete, but this may add complexity with not much advantage) (或者可能在结果完成时将结果组合在一起,但这可能会增加复杂性而没有太大的优势)

If you are using a short I would use at a collection like TObjectIntHashMap which is more efficient for handling primitives. 如果你使用的是short,我会在像TobjectIntHashMap这样的集合中使用它,这对于处理原语更有效。


In the simple case you have short co-ordinates public static void main(String... args) throws IOException { int length = 10 * 1000 * 1000; 在简单的情况下,你有short坐标public static void main(String ... args)throws IOException {int length = 10 * 1000 * 1000; int[] x = new int[length]; int [] x = new int [length]; int[] y = new int[length]; int [] y = new int [length];

  Random rand = new Random();
  for (int i = 0; i < length; i++) {
    x[i] = rand.nextInt(10000) - rand.nextInt(10000);
    y[i] = rand.nextInt(10000) - rand.nextInt(10000);
  }

  countPointsWithLongIntMap(x, y);
  countPointsWithMap(x, y);

}

private static Map<String, Short> countPointsWithMap(int[] x, int[] y) {
  long start = System.nanoTime();
  Map<String, Short> counts = new LinkedHashMap<String, Short>();
  for (int i = 0; i < x.length; i++) {
    String key = x[i] + "," + y[i];
    Short s = counts.get(key);
    if (s == null)
      counts.put(key, (short) 1);
    else
      counts.put(key, (short) (s + 1));
  }
  long time = System.nanoTime() - start;
  System.out.printf("Took %.3f seconds to use Map<String, Short>%n", time/1e9);

  return counts;
}

private static TIntIntHashMap countPointsWithLongIntMap(int[] x, int[] y) {
  long start = System.nanoTime();
  TIntIntHashMap counts = new TIntIntHashMap();
  for (int i = 0; i < x.length; i++) {
    int key =  (x[i] << 16) | (y[i] & 0xFFFF);
    counts.adjustOrPutValue(key, 1, 1);
  }
  long time = System.nanoTime() - start;
  System.out.printf("Took %.3f seconds to use TIntIntHashMap%n", time/1e9);
  return counts;
}

prints 版画

Took 1.592 seconds to use TIntIntHashMap
Took 4.889 seconds to use Map<String, Short>

If you have double co-ordinates, you need to use a two tier map. 如果您有双坐标,则需要使用双层地图。

public static void main(String... args) throws IOException {
  int length = 10 * 1000 * 1000;
  double[] x = new double[length];
  double[] y = new double[length];

  Random rand = new Random();
  for (int i = 0; i < length; i++) {
    x[i] = (rand.nextInt(10000) - rand.nextInt(10000)) / 1e4;
    y[i] = (rand.nextInt(10000) - rand.nextInt(10000)) / 1e4;
  }

  countPointsWithLongIntMap(x, y);
  countPointsWithMap(x, y);

}

private static Map<String, Short> countPointsWithMap(double[] x, double[] y) {
  long start = System.nanoTime();
  Map<String, Short> counts = new LinkedHashMap<String, Short>();
  for (int i = 0; i < x.length; i++) {
    String key = x[i] + "," + y[i];
    Short s = counts.get(key);
    if (s == null)
      counts.put(key, (short) 1);
    else
      counts.put(key, (short) (s + 1));
  }
  long time = System.nanoTime() - start;
  System.out.printf("Took %.3f seconds to use Map<String, Short>%n", time / 1e9);

  return counts;
}

private static TDoubleObjectHashMap<TDoubleIntHashMap> countPointsWithLongIntMap(double[] x, double[] y) {
  long start = System.nanoTime();
  TDoubleObjectHashMap<TDoubleIntHashMap> counts = new TDoubleObjectHashMap<TDoubleIntHashMap>();
  for (int i = 0; i < x.length; i++) {
    TDoubleIntHashMap map = counts.get(x[i]);
    if (map == null)
      counts.put(x[i], map = new TDoubleIntHashMap());
    map.adjustOrPutValue(y[i], 1, 1);
  }
  long time = System.nanoTime() - start;
  System.out.printf("Took %.3f seconds to use TDoubleObjectHashMap<TDoubleIntHashMap>%n", time / 1e9);
  return counts;
}

prints 版画

Took 3.023 seconds to use TDoubleObjectHashMap<TDoubleIntHashMap>
Took 7.970 seconds to use Map<String, Short>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM