拆分java.util.stream.Stream

Question

我有一個包含URL和電子郵件的文本文件。 我需要從文件中提取所有這些內容。 每個URL和電子郵件可以找到一次以上，但結果不應包含重復項。 我可以使用以下代碼提取所有URL：

Files.lines(filePath).
    .map(urlPattern::matcher)
    .filter(Matcher::find)
    .map(Matcher::group)
    .distinct();

我可以使用以下代碼提取所有電子郵件：

Files.lines(filePath).
    .map(emailPattern::matcher)
    .filter(Matcher::find)
    .map(Matcher::group)
    .distinct();

我可以只提取一次讀取Files.lines(filePath)返回的流的所有URL和電子郵件嗎？ 類似於將行流分割為URL流和電子郵件流的東西。

Answer 1

你可以使用partitioningBy collector，雖然它仍然不是很優雅的解決方案。

Map<Boolean, List<String>> map = Files.lines(filePath)
        .filter(str -> urlPattern.matcher(str).matches() ||
                       emailPattern.matcher(str).matches())
        .distinct()
        .collect(Collectors.partitioningBy(str -> urlPattern.matcher(str).matches()));
List<String> urls = map.get(true);
List<String> emails = map.get(false);

如果您不想兩次應用regexp，可以使用中間對對象（例如， SimpleEntry ）：

public static String classify(String str) {
    return urlPattern.matcher(str).matches() ? "url" : 
        emailPattern.matcher(str).matches() ? "email" : null;
}

Map<String, Set<String>> map = Files.lines(filePath)
        .map(str -> new AbstractMap.SimpleEntry<>(classify(str), str))
        .filter(e -> e.getKey() != null)
        .collect(Collectors.groupingBy(e -> e.getKey(),
            Collectors.mapping(e -> e.getValue(), Collectors.toSet())));

使用我的免費StreamEx庫，最后一步將更短：

Map<String, Set<String>> map = StreamEx.of(Files.lines(filePath))
        .mapToEntry(str -> classify(str), Function.identity())
        .nonNullKeys()
        .grouping(Collectors.toSet());

Answer 2

您可以在Collector執行匹配：

Map<String,Set<String>> map=Files.lines(filePath)
    .collect(HashMap::new,
        (hm,line)-> {
            Matcher m=emailPattern.matcher(line);
            if(m.matches())
              hm.computeIfAbsent("mail", x->new HashSet<>()).add(line);
            else if(m.usePattern(urlPattern).matches())
              hm.computeIfAbsent("url", x->new HashSet<>()).add(line);
        },
        (m1,m2)-> m2.forEach((k,v)->m1.merge(k, v,
                                     (s1,s2)->{s1.addAll(s2); return s1;}))
    );
Set<String> mail=map.get("mail"), url=map.get("url");

請注意，這可以很容易地適應在一行中找到多個匹配：

Map<String,Set<String>> map=Files.lines(filePath)
    .collect(HashMap::new,
        (hm,line)-> {
            Matcher m=emailPattern.matcher(line);
            while(m.find())
              hm.computeIfAbsent("mail", x->new HashSet<>()).add(m.group());
            m.usePattern(urlPattern).reset();
            while(m.find())
              hm.computeIfAbsent("url", x->new HashSet<>()).add(m.group());
        },
        (m1,m2)-> m2.forEach((k,v)->m1.merge(k, v,
                                     (s1,s2)->{s1.addAll(s2); return s1;}))
    );

Answer 3

由於你不能重復使用Stream，我認為唯一的選擇是“手動執行”。

File.lines(filePath).forEach(s -> /** match and sort into two lists */ );

如果有另一個解決方案，雖然我很樂意了解它！

Answer 4

整體問題應該是：為什么你只想要流一次？

提取URL和提取電子郵件是不同的操作，因此應該在他們自己的流操作中處理。 即使基礎流源包含數十萬條記錄，與映射和過濾操作相比，迭代的時間也可以忽略不計。

作為可能的性能問題，您應該考慮的唯一事情是IO操作。 因此，最干凈的解決方案是只讀取一次文件，然后在結果集合上流式傳輸兩次：

List<String> allLines = Files.readAllLines(filePath);
allLines.stream() ... // here do the URLs
allLines.stream() ... // here do the emails

當然這需要一些記憶。

拆分java.util.stream.Stream

問題描述

4 個解決方案

解決方案1
10 已采納 2015-05-13 11:05:14

解決方案2
4 2015-05-13 11:26:39

解決方案3
1 2015-05-13 11:04:15

解決方案4
0 2015-05-13 11:10:34

拆分java.util.stream.Stream

問題描述

4 個解決方案

解決方案1 10 已采納 2015-05-13 11:05:14

解決方案2 4 2015-05-13 11:26:39

解決方案3 1 2015-05-13 11:04:15

解決方案4 0 2015-05-13 11:10:34

解決方案1
10 已采納 2015-05-13 11:05:14

解決方案2
4 2015-05-13 11:26:39

解決方案3
1 2015-05-13 11:04:15

解決方案4
0 2015-05-13 11:10:34