简体   繁体   English

比较大型列表并提取丢失的数据

[英]Compare large lists and extract missing data

I have two very large ArrayList , each containing millions of data. 我有两个非常大的ArrayList ,每个都包含数百万个数据。 I want to filter out data from List1 which is not present in List2 and / or vice-versa. 我想过滤掉List1中不存在于List2和/或反之亦然。

I've tried Apache CollectionUtils, Java 8 stream API without any success. 我已经尝试过Apache CollectionUtils,Java 8流API但没有任何成功。

Java 8 parallel streaming is consuming all the CPU and CollectionUtils keeps on comparing data set without any output. Java 8并行流消耗所有CPU,而CollectionUtils继续比较没有任何输出的数据集。

POJO Sample POJO样本

public DataVO {
 private String id;
 private String value;
 ...
 // getters / setters

 @Override
 public int hashCode() {
  final int prime = 31;
  int result = 1;
  result = (prime * result) + ((id == null) ? 0 : id.hashCode());
  return result;
 }

 @Override
 public boolean equals(final Object obj) {
  ...
  ...
  final DataVO other = (DataVO) obj;
  if (id == null) {
   if (other.id != null) {
    return false;
   }
  }
  else if (!id.equals(other.id)) {
   return false;
  }
  return true;
 }
}

hashCode() / equals() can have more fields, for now I've kept it simple. hashCode()/ equals()可以有更多字段,现在我保持简单。

I also tried spliting List1 into smaller chunks and then tried comparing against List2 without any results. 我也尝试将List1拆分成较小的块,然后尝试与List2进行比较而没有任何结果。 I've looked at other questions but none of them have consider extremely large volume. 我看过其他问题,但没有人考虑过非常大的音量。

Please let me know if you have any pointers. 如果您有任何指示,请告诉我。

You could read big chunks of the ArrayList into a HashSet , say by 10k elements. 您可以将大块的ArrayList读入HashSet ,比如10k元素。 Make sure you set the size on the HashSet constructor. 确保在HashSet构造函数上设置大小。 Then for each chunk call HashSet#RemoveAll with the other ArrayList . 然后为每个块调用HashSet#RemoveAll与另一个ArrayList The remaining entries are your answer. 其余的条目是你的答案。 Might even parallelise with a ThreadPoolExecutor . 甚至可以与ThreadPoolExecutor并行化。

List missing = new ArrayList(); // answer

for (int i = 0; i < list1.size(); ) {
    int offset = i;
    i += 16 * 1024;
    if (i > list1.size()) i = list1.size();
    Set chunk = new HashSet(list1.subList(offset, i));

    for (int j = list2.size(); --j >= 0; chunk.remove(list2.get(j));
    missing.addAll(chunk);
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM