简体   繁体   English

使用 Java Streams 返回单词出现的句子的计数和列表

[英]Return count and list of sentences where word appears using Java Streams

I'm stuck trying to get in what sentences each word appears.我被困在试图了解每个单词出现的句子中。 The entry would be a list of sentences该条目将是一个句子列表

Question, what kind of wine is best? 
White wine.
A question

and the output would be和 output 将是

// format would be: word:{count: sentence1, sentence2,...}
a:{1:3} 
wine:{2:1,2} 
best:{1:1} 
is:{1:1} 
kind:{1:1} 
of:{1:1} 
question:{2:1,3} 
what:{1:1}
white:{1:2}

This is what I get so far:这是我到目前为止得到的:

static void getFrequency(List<String> inputLines) {
  List<String> list = inputLines.stream()
     .map(w -> w.split("[^a-zA-Z0-9]+"))
     .flatMap(Arrays::stream)
     .map(String::toLowerCase)
     .collect(Collectors.toList());

   Map<String, Integer> wordCounter = list.stream()
     .collect(Collectors.toMap(w -> w, w -> 1, Integer::sum));
}

With that I'm only getting the count of times each word appears in all the sentences, but I need to get also the list of sentences where the word appears.这样,我只能计算每个单词在所有句子中出现的次数,但我还需要获取该单词出现的句子列表。 It looks like maybe to get the id of sentences I can use IntStream.range , something like this:看起来可能是为了获取我可以使用IntStream.range的句子的 id,如下所示:

 IntStream.range(1, inputLines.size())
          .mapToObj(i -> inputLines.get(i));

But I'm not sure if that is the best way to do it, I'm new with Java但我不确定这是否是最好的方法,我是 Java 的新手

You can use a grouping collector to compute a word to index list map.您可以使用分组收集器计算单词以索引列表 map。 Here's an example:这是一个例子:

private static Map<String, List<Integer>> getFrequency(List<String> inputLines) {
    return IntStream.range(0, inputLines.size())
            .mapToObj(line -> Arrays.stream(inputLines.get(line)
                 .split("[^a-zA-Z0-9]+"))
                 .map(word -> new SimpleEntry<>(word.toLowerCase(), line + 1)))
            .flatMap(Function.identity())
            .collect(Collectors.groupingBy(Entry::getKey, 
                  Collectors.mapping(Entry::getValue, Collectors.toList())));
}

With your test data, I get有了你的测试数据,我得到

{a=[3], what=[1], white=[2], question=[1, 3], kind=[1], 
 of=[1], best=[1], is=[1], wine=[1, 2]}

The count is easy to infer from the list size, so there should be no need for an additional class.计数很容易从列表大小中推断出来,因此不需要额外的 class。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM