简体   繁体   English

Java:如何使用大量谓词过滤大量对象?

[英]Java: How to filter a big collection of objects with a big collection of predicates?

In Java, I have a big collection of objects (~ 10,000 objects) , say Set<Person> cityInhabitants . 在Java中,我有Set<Person> cityInhabitants 对象(~10,000个对象) ,比如Set<Person> cityInhabitants I also have a big collection of predicates (~ 1,000 predicates) which shall be used to filter away any Person matching any of these predicates. 我还有一个大的谓词集合(~1,000个谓词) ,它们将用于过滤掉任何与这些谓词匹配的Person Predicates could be for example 例如,谓词可能是

  • person.getName().equals("ugly name1")
  • person.getName().equals("ugly name2")
  • person.getAge() < 18 . person.getAge() < 18

This requirement calls for the following challenges: 此要求需要以下挑战:

  • the filtering shall be fast 过滤应该很快
  • the predicates are "business defined" and therefore it shall be easy to add and remove predicates. 谓词是“业务定义的”,因此添加和删除谓词应该很容易。 That means the predicates probably shouldn't be hard-coded in source code, but better be maintained in a database (?) 这意味着谓词可能不应该在源代码中进行硬编码,但最好在数据库中维护(?)

What are solutions to these challenges? 这些挑战有哪些解决方案? Are there any libraries that can help here? 有没有可以帮助的图书馆?

I would suggest you sort the predicates in order of speed of execution. 我建议你按照执行速度的顺序对谓词进行排序。 You can then execute your predicates in order of speed, using the fastest ones first, generally meaning the slower predicates will have to run over a smaller set. 然后,您可以按速度顺序执行谓词,首先使用最快的谓词,通常意味着较慢的谓词必须在较小的集合上运行。

However this assumption is not completely correct, you would need to work out the percentage of predicates removed to speed of execution. 但是,这种假设并不完全正确,您需要计算出被删除的谓词百分比以便执行速度。 Then we can see which is the fastest predicate that removes the highest percentage of objects. 然后我们可以看到哪个是移除最高百分比对象的最快谓词。 We can then execute the predicates in this order to me most optimal. 然后我们可以按此顺序执行谓词给我最优化。

You can easily implement your own predicate interface 您可以轻松实现自己的谓词interface

public interface Predicate<T> {

    boolean filter(T object);

}

You would then need to create predicate object for each of the rules. 然后,您需要为每个规则创建谓词对象。 You can make some more dynamic classes for age and name checking which will reduce the amount of code you will need also. 您可以为年龄和名称检查创建一些更动态的类,这将减少您还需要的代码量。

public class AgeCheck<T> implements Predicate<T> {

    private final int min;
    private final int max;
    public AgeCheck(int min, int max) {
        this.min = min;
        this.max = max;
    }

    @Override
    public boolean filter(T object) {
        // if( t.age() < max && t.age > min) ...
    }

}

In this situation there is not much you can do in relation to the complexity of the operation itself. 在这种情况下,与操作本身的复杂性相比,您无能为力。 If entries are many, predicates are many and predicates are expensive then you can optimize to be fast as you can but you won't surely get over a certain threshold because the single operation here maybe expensive. 如果条目很多,谓词很多,谓词很贵,那么你可以尽可能快地进行优化,但是你肯定不会超过某个阈值,因为这里的单个操作可能很昂贵。

You should test different approaches and see whatever performs better in your specific situation: 您应该测试不同的方法,看看在特定情况下表现更好的方法:

  • sort predicates according by first checking the ones that should be wider (so that the first predicates will filter out as many entries as possible) 排序谓词首先检查应该更宽的谓词(这样第一个谓词将过滤掉尽可能多的条目)
  • sort predicates according to their complexity (so that faster will be executed first and the slower on less entries) 根据它们的复杂性对谓词进行排序(因此首先执行速度更快,而在较少的条目中执行速度较慢)
  • don't update the original data structure but keep a parallel set that will contain the filtered elements vs 不要更新原始数据结构,而是保留包含已过滤元素vs的并行集
  • always update the data structure so that you will loop over smaller amount of people everytime 始终更新数据结构,以便您每次都可以遍历少量人员

Here's an alternative: Identify all possible attributes that an instance of a class might have. 这是另一种选择:识别类实例可能具有的所有可能属性。 In your example, you have a person class with two attributes; 在您的示例中,您有一个具有两个属性的person类; name and age. 姓名和年龄。 Because you have getters for these attributes, it's likely that at most, a person can have two attributes (unless there are other getters that you didn't mention). 因为你有这些属性的getter,所以最多一个person可能有两个属性(除非你没有提到其他的getter)。 You could have implemented person such that the attributes are held in a collection, so that you really have no limit on the number of attributes. 您可以实现person ,使属性保存在集合中,这样您就可以对属性数量进行限制。 Regardless of how it is implemented, identify all the attributes. 无论如何实现,都要确定所有属性。

Now for each attribute, associate a unique prime number and then for each instance of person maintain the product of those prime numbers corresponding to those attributes assigned to that person . 现在,每一个属性,关联独特的素数,然后为每个实例person保持与分配给这些属性的素数的乘积person For example, assume a person can be young or old, male or female, good looking or bad looking. 例如,假设一个人可以是年轻人或老年人,男性或女性,外表美观或不好看。 That's 6 attributes and let's assign prime numbers as follows: 这是6个属性,让我们按如下方式分配素数:

02: young
03: old
05: male
07: female
11: good looking
13: bad looking

Continuing the example, assume a person is a good looking, young female. 继续这个例子,假设一个人是一个好看的年轻女性。 The product of the prime numbers would be 2 X 7 X 11, or 154. 素数的乘积为2 X 7 X 11或154。

Now you want to find all good looking young people, regardless of gender. 现在你想要找到所有漂亮的年轻人,不分性别。 The product of primes associated with this predicate is 2 X 11, or 22. 与该谓词相关的素数的乘积是2 X 11或22。

So you can now iterate through all your people and if the product of primes associated with each people can be divided by 22 without any remainder (it can in the case where the person with a product of primes is 154), then you have a match. 所以,你现在可以通过所有的迭代people ,如果与各相关素数的乘积people可以通过22没有任何剩余部分(它可以在其中的情况下被划分person与质数的积为154),那么你有一个匹配。

You might want to use the BigNumber class to perform the multiplication, division and the storing of the product of primes. 您可能希望使用BigNumber类来执行乘法,除法和素数乘积的存储。

This solution is very fast if you are given a person and asked if it matches all the predicates (again, the predicates have been reduced to unique prime numbers and the collection of predicates is now represented by the product of those prime numbers). 如果给一个person并且询问它是否与所有谓词匹配,那么这个解决方案非常快(同样,谓词已经被简化为唯一素数,而谓词集合现在由这些素数的乘积表示)。

This solution may not be so fast if you have to iterate over your entire collection of people looking for a match. 如果你必须遍历寻找匹配的整个people这个解决方案可能不会那么快。

( I hadn't realized this question was 2 years old. I'm so late to this party! It would be good to know what solution the author ended up using. ) 我没有意识到这个问题已经过了2年。我在这个派对上已经很晚了!知道作者最终使用了什么解决方案会很好。

Are there any libraries that can help here? 有没有可以帮助的图书馆? Well, sure there are! 好吧,确定有!

Your data collection isn't very large, but you have a disproportionately large number of predicates. 您的数据收集不是很大,但是您的谓词数量不成比例。 Plus, you want these predicates to be managed by your users, and stored centrally etc. This sounds like a good fit for Drools , which is a rules engine, and comes with additional tooling to author, validate and store such rules. 此外,您希望这些谓词由您的用户管理,并集中存储等。这听起来非常适合Drools ,这是一个规则引擎,并附带了额外的工具来创作,验证和存储这些规则。

But Drools can be large and involved. 但是Drools可能很大而且参与其中。 Perhaps you need something simpler? 也许你需要更简单的东西? Your code sample, and your first requirement of speed, made me think of CQEngine , which is a library for indexing objects. 您的代码示例和您对速度的第一个要求让我想到了CQEngine ,它是一个用于索引对象的库。 It indexes fields (such as your 'name field) and it can search these fields in various ways (equals, starts with, contains, etc). 它索引字段(例如您的'名称字段),它可以以各种方式搜索这些字段(等号,开头,包含等)。 It is fast and it is simple, but it can only index. 它很快而且很简单,但它只能索引。 You would yourself have to come up with rule definitions etc. On the other hand, CQEngine supports logical predicates, so you can chain your predicates together with and/or. 您自己必须提出规则定义等。另一方面,CQEngine支持逻辑谓词,因此您可以将谓词链接到和/或。

And there are other libraries for rules engines or object indexing. 还有其他用于规则引擎或对象索引的库。 I'm sure other people will list them in their answers. 我相信其他人会在答案中列出他们。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM