简体   繁体   English

使用 Java 8 的复杂自定义收集器

[英]Complex custom Collector with Java 8

I have a stream of objects which I would like to collect the following way.我有一个对象流,我想通过以下方式收集它们。

Let's say we are handling forum posts :假设我们正在处理论坛帖子

class Post {
    private Date time;
    private Data data
}

I want to create a list which groups posts by a period.我想创建一个按时间段对帖子进行分组的列表。 If there were no posts for X minutes, create a new group .如果X分钟内没有帖子,请创建一个新组

class PostsGroup{
    List<Post> posts = new ArrayList<> ();
}

I want to get a List<PostGroups> containing the posts grouped by the interval.我想要一个List<PostGroups>包含按时间间隔分组的帖子

Example: interval of 10 minutes.示例:间隔10分钟。

Posts:帖子:

[{time:x, data:{}}, {time:x + 3, data:{}} , {time:x + 12, data:{}, {time:x + 45, data:{}}}]

I want to get a list of posts group :我想获取帖子组列表:

[
 {posts : [{time:x, data:{}}, {time:x + 3, data:{}}, {time:x + 12, data:{}]]},
{posts : [{time:x + 45, data:{}]}
]
  • notice that the first group lasted till X + 22 .请注意,第一持续到X + 22 Then a new post was received at X + 45 .然后在X + 45收到了一个新帖子

Is this possible?这可能吗?

This problem could be easily solved using the groupRuns method of my StreamEx library:使用我的StreamEx库的groupRuns方法可以轻松解决此问题:

long MAX_INTERVAL = TimeUnit.MINUTES.toMillis(10);
StreamEx.of(posts)
        .groupRuns((p1, p2) -> p2.time.getTime() - p1.time.getTime() <= MAX_INTERVAL)
        .map(PostsGroup::new)
        .toList();

I assume that you have a constructor我假设你有一个构造函数

class PostsGroup {
    private List<Post> posts;

    public PostsGroup(List<Post> posts) {
        this.posts = posts;
    }
}

The StreamEx.groupRuns method takes a BiPredicate which is applied to two adjacent input elements and returns true if they must be grouped together. StreamEx.groupRuns方法采用BiPredicate应用于两个相邻的输入元素,如果它们必须组合在一起,则返回 true。 This method creates the stream of lists where each list represents the group.此方法创建列表流,其中每个列表代表组。 This method is lazy and works fine with parallel streams.此方法是惰性的,并且适用于并行流。

You need to retain state between stream entries and write yourself a grouping classifier.您需要保留流条目之间的状态并为自己编写一个分组分类器。 Something like this would be a good start.像这样的事情将是一个好的开始。

class Post {

    private final long time;
    private final String data;

    public Post(long time, String data) {
        this.time = time;
        this.data = data;
    }

    @Override
    public String toString() {
        return "Post{" + "time=" + time + ", data=" + data + '}';
    }

}

public void test() {
    System.out.println("Hello");
    long t = 0;
    List<Post> posts = Arrays.asList(
            new Post(t, "One"),
            new Post(t + 1000, "Two"),
            new Post(t + 10000, "Three")
    );
    // Group every 5 seconds.
    Map<Long, List<Post>> gouped = posts
            .stream()
            .collect(Collectors.groupingBy(new ClassifyByTimeBetween(5000)));
    gouped.entrySet().stream().forEach((e) -> {
        System.out.println(e.getKey() + " -> " + e.getValue());
    });

}

class ClassifyByTimeBetween implements Function<Post, Long> {

    final long delay;
    long currentGroupBy = -1;
    long lastDateSeen = -1;

    public ClassifyByTimeBetween(long delay) {
        this.delay = delay;
    }

    @Override
    public Long apply(Post p) {
        if (lastDateSeen >= 0) {
            if (p.time > lastDateSeen + delay) {
                // Grab this one.
                currentGroupBy = p.time;
            }
        } else {
            // First time - start there.
            currentGroupBy = p.time;
        }
        lastDateSeen = p.time;
        return currentGroupBy;
    }

}

Since no one has provided a solution with a custom collector as it was required in the original problem statement, here is a collector-implementation that groups Post objects based on the provided time-interval.由于没有人提供原始问题陈述中要求的自定义收集器的解决方案,因此这里是一个收集器实现,它根据提供的时间间隔对Post对象进行分组。

Date class mentioned in the question is obsolete since Java 8 and not recommended to be used in new projects.问题中提到的Date类自 Java 8 以来已过时,不建议在新项目中使用。 Hence, LocalDateTime will be utilized instead.因此,将改为使用LocalDateTime

Post & PostGroup邮政和邮政集团

For testing purposes, I've used Post implemented as a Java 16 record ( if you substitute it with a class, the overall solution will be fully compliant with Java 8 ):出于测试目的,我使用Post实现为 Java 16记录如果将其替换为类,则整体解决方案将完全符合 Java 8 ):

public record Post(LocalDateTime dateTime) {}

Also, I've enhanced the PostGroup object.此外,我还增强了PostGroup对象。 My idea is that it should be capable to decide whether the offered Post should be added to the list of posts or rejected as the Information expert principle suggests ( in short: all manipulations with the data should happen only inside a class to which that data belongs ).我的想法是,它应该能够决定是否应该将提供的Post添加到帖子列表中,或者按照信息专家原则的建议被拒绝(简而言之:对数据的所有操作都应该只发生在该数据所属的类中)。

To facilitate this functionality, two extra fields were added: interval of type Duration from the java.time package to represent the maximum interval between the earliest post and the latest post in a group , and intervalBound of type LocalDateTime which gets initialized after the first post will be added a later on will be used internally by the method isWithinInterval() to check whether the offered post fits into the interval .为了促进此功能,添加了两个额外字段: java.time包中的Duration类型的interval ,表示最早帖子最新帖子之间的最大间隔,以及LocalDateTime类型的intervalBound ,它在第一次发布后初始化稍后将被添加,将由isWithinInterval()方法在内部使用,以检查提供的帖子是否适合interval

public class PostsGroup {
    private Duration interval;
    private LocalDateTime intervalBound;
    private List<Post> posts = new ArrayList<>();
    
    public PostsGroup(Duration interval) {
        this.interval = interval;
    }
    
    public boolean tryAdd(Post post) {
        if (posts.isEmpty()) {
            intervalBound = post.dateTime().plus(interval);
            return posts.add(post);
        } else if (isWithinInterval(post)) {
            return posts.add(post);
        }
        return false;
    }
    
    public boolean isWithinInterval(Post post) {
        return post.dateTime().isBefore(intervalBound);
    }
    
    @Override
    public String toString() {
        return "PostsGroup{" + posts + '}';
    }
}

I'm making two assumptions:我做了两个假设:

  • All posts in the source are sorted by time (if it is not the case, you should introduce sorted() operation in the pipeline before collecting the results);源中的所有帖子都是按时间排序的(如果不是这样,你应该在收集结果之前在管道中引入sorted()操作);
  • Posts need to be collected into the minimum number of groups, as a consequence of this it's not possible to split this task and execute stream in parallel.帖子需要收集到最少数量的组中,因此无法拆分此任务并并行执行流。

Building a Custom Collector构建自定义收集器

We can create a custom collector either inline by using one of the versions of the static method Collector.of() or by defining a class that implements the Collector interface.我们可以通过使用静态方法Collector.of()的一个版本或通过定义实现Collector接口的class来内联创建自定义收集器。

These parameters have to be provided while creating a custom collector :创建自定义收集器时必须提供这些参数

  • Supplier Supplier<A> is meant to provide a mutable container which store elements of the stream.供应商Supplier<A>旨在提供一个可变容器来存储流的元素。 In this case, ArrayDeque (as an implementation of the Deque interface) will be handy as a container to facilitate the convenient access to the most recently added element, ie the latest PostGroup .在这种情况下, ArrayDeque (作为Deque接口的实现)将作为容器方便地访问最近添加的元素,即最新的PostGroup

  • Accumulator BiConsumer<A,T> defines how to add elements into the container provided by the supplier .累加器BiConsumer<A,T>定义如何将元素添加到供应商提供的容器中。 For this task, we need to provide the logic on that will allow determining whether the next element from the stream (ie the next Post ) should go into the last PostGroup in the Deque , or a new PostGroup needs to be allocated for it.对于这个任务,我们需要提供逻辑来确定流中的下一个元素(即下一个Post )是否应该进入Deque中的最后一个PostGroup ,或者需要为其分配一个新的PostGroup

  • Combiner BinaryOperator<A> combiner() establishes a rule on how to merge two containers obtained while executing stream in parallel. Combiner BinaryOperator<A> combiner()建立了一个规则,用于合并并行执行流时获得的两个容器 Since this operation is treated as not parallelizable, the combiner is implemented to throw an AssertionError in case of parallel execution.由于此操作被视为不可并行化,因此组合器被实现为在并行执行的情况下抛出AssertionError

  • Finisher Function<A,R> is meant to produce the final result by transforming the mutable container. Finisher Function<A,R>旨在通过转换可变容器来产生最终结果。 The finisher function in the code below turns the container , a deque containing the result, into an immutable list .下面代码中的finisher函数将容器(包含结果的双端队列)转换为不可变列表

Note: Java 16 method toList() is used inside the finisher function, for Java 8 it can be replaced with collect(Collectors.toUnmodifiableList()) or collect(Collectors.toList()) .注意: Java 16 的toList()方法在Finisher函数中使用,对于 Java 8,它可以替换为collect(Collectors.toUnmodifiableList())collect(Collectors.toList())

  • Characteristics allow providing additional information, for instance Collector.Characteristics.UNORDERED which is used in this case denotes that the order in which partial results of the reduction produced while executing in parallel is not significant.特性允许提供附加信息,例如在这种情况下使用的Collector.Characteristics.UNORDERED表示并行执行时产生的部分归约结果的顺序并不重要。 In this case, collector doesn't require any characteristics.在这种情况下,收集器不需要任何特性。

The method below is responsible for generating the collector based on the provided interval .下面的方法负责根据提供的时间间隔生成收集器

public static Collector<Post, ?, List<PostsGroup>> groupPostsByInterval(Duration interval) {
    
    return Collector.of(
        ArrayDeque::new,
        (Deque<PostsGroup> deque, Post post) -> {
            if (deque.isEmpty() || !deque.getLast().tryAdd(post)) { // if no groups have been created yet or if adding the post into the most recent group fails
                PostsGroup postsGroup = new PostsGroup(interval);
                postsGroup.tryAdd(post);
                deque.addLast(postsGroup);
            }
        },
        (Deque<PostsGroup> left, Deque<PostsGroup> right) -> { throw new AssertionError("should not be used in parallel"); },
        (Deque<PostsGroup> deque) -> deque.stream().collect(Collectors.collectingAndThen(Collectors.toUnmodifiableList())));
}

main() - demo main() - 演示

public static void main(String[] args) {
    List<Post> posts =
        List.of(new Post(LocalDateTime.of(2022,4,28,15,0)),
                new Post(LocalDateTime.of(2022,4,28,15,3)),
                new Post(LocalDateTime.of(2022,4,28,15,5)),
                new Post(LocalDateTime.of(2022,4,28,15,8)),
                new Post(LocalDateTime.of(2022,4,28,15,12)),
                new Post(LocalDateTime.of(2022,4,28,15,15)),
                new Post(LocalDateTime.of(2022,4,28,15,18)),
                new Post(LocalDateTime.of(2022,4,28,15,27)),
                new Post(LocalDateTime.of(2022,4,28,15,48)),
                new Post(LocalDateTime.of(2022,4,28,15,54)));
    
    Duration interval = Duration.ofMinutes(10);

    List<PostsGroup> postsGroups = posts.stream()
        .collect(groupPostsByInterval(interval));
    
    postsGroups.forEach(System.out::println);
}

Output:输出:

PostsGroup{[Post[dateTime=2022-04-28T15:00], Post[dateTime=2022-04-28T15:03], Post[dateTime=2022-04-28T15:05], Post[dateTime=2022-04-28T15:08]]}
PostsGroup{[Post[dateTime=2022-04-28T15:12], Post[dateTime=2022-04-28T15:15], Post[dateTime=2022-04-28T15:18]]}
PostsGroup{[Post[dateTime=2022-04-28T15:27]]}
PostsGroup{[Post[dateTime=2022-04-28T15:48], Post[dateTime=2022-04-28T15:54]]}

You can also play around with this Online Demo你也可以玩这个在线演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM