简体   繁体   English

词频统计 Java 8

[英]Word frequency count Java 8

How to count the frequency of words of List in Java 8?如何统计Java 8中List的词频?

List <String> wordsList = Lists.newArrayList("hello", "bye", "ciao", "bye", "ciao");

The result must be:结果必须是:

{ciao=2, hello=1, bye=2}

I want to share the solution I found because at first I expected to use map-and-reduce methods, but it was a bit different.我想分享我找到的解决方案,因为起初我希望使用 map-and-reduce 方法,但它有点不同。

Map<String,Long> collect = wordsList.stream()
    .collect( Collectors.groupingBy( Function.identity(), Collectors.counting() ));

Or for Integer values:或者对于整数值:

Map<String,Integer> collect = wordsList.stream()
     .collect( Collectors.groupingBy( Function.identity(), Collectors.summingInt(e -> 1) ));

EDIT编辑

I add how to sort the map by value:我添加了如何按值对地图进行排序:

LinkedHashMap<String, Long> countByWordSorted = collect.entrySet()
            .stream()
            .sorted(Map.Entry.comparingByValue(Comparator.reverseOrder()))
            .collect(Collectors.toMap(
                    Map.Entry::getKey,
                    Map.Entry::getValue,
                    (v1, v2) -> {
                        throw new IllegalStateException();
                    },
                    LinkedHashMap::new
            ));

( NOTE: See the edits below ) 注意:请参阅下面的编辑

As an alternative to Mounas answer , here is an approach that does the word count in parallel:作为Mounas answer的替代方案,这是一种并行计算字数的方法:

import java.util.Arrays;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;

public class ParallelWordCount
{
    public static void main(String[] args)
    {
        List<String> list = Arrays.asList(
            "hello", "bye", "ciao", "bye", "ciao");
        Map<String, Integer> counts = list.parallelStream().
            collect(Collectors.toConcurrentMap(
                w -> w, w -> 1, Integer::sum));
        System.out.println(counts);
    }
}

EDIT In response to the comment, I ran a small test with JMH, comparing the toConcurrentMap and the groupingByConcurrent approach, with different input list sizes and random words of different lengths.编辑作为对评论的回应,我用 JMH 进行了一个小测试,比较了toConcurrentMapgroupingByConcurrent方法,不同的输入列表大小和不同长度的随机单词。 This test suggested that the toConcurrentMap approach was faster.该测试表明toConcurrentMap方法更快。 When considering how different these approaches are "under the hood", it's hard to predict something like this.当考虑这些方法“在幕后”有多么不同时,很难预测这样的事情。

As a further extension, based on further comments, I extended the test to cover all four combinations of toMap , groupingBy , serial and parallel.作为进一步的扩展,基于进一步的评论,我扩展了测试以涵盖toMapgroupingBy 、串行和并行的所有四种组合。

The results are still that the toMap approach is faster, but unexpectedly (at least, for me) the "concurrent" versions in both cases are slower than the serial versions...:结果仍然是toMap方法更快,但出乎意料(至少,对我来说)两种情况下的“并发”版本都比串行版本慢......:

             (method)  (count) (wordLength)  Mode  Cnt     Score    Error  Units
      toConcurrentMap     1000            2  avgt   50   146,636 ±  0,880  us/op
      toConcurrentMap     1000            5  avgt   50   272,762 ±  1,232  us/op
      toConcurrentMap     1000           10  avgt   50   271,121 ±  1,125  us/op
                toMap     1000            2  avgt   50    44,396 ±  0,541  us/op
                toMap     1000            5  avgt   50    46,938 ±  0,872  us/op
                toMap     1000           10  avgt   50    46,180 ±  0,557  us/op
           groupingBy     1000            2  avgt   50    46,797 ±  1,181  us/op
           groupingBy     1000            5  avgt   50    68,992 ±  1,537  us/op
           groupingBy     1000           10  avgt   50    68,636 ±  1,349  us/op
 groupingByConcurrent     1000            2  avgt   50   231,458 ±  0,658  us/op
 groupingByConcurrent     1000            5  avgt   50   438,975 ±  1,591  us/op
 groupingByConcurrent     1000           10  avgt   50   437,765 ±  1,139  us/op
      toConcurrentMap    10000            2  avgt   50   712,113 ±  6,340  us/op
      toConcurrentMap    10000            5  avgt   50  1809,356 ±  9,344  us/op
      toConcurrentMap    10000           10  avgt   50  1813,814 ± 16,190  us/op
                toMap    10000            2  avgt   50   341,004 ± 16,074  us/op
                toMap    10000            5  avgt   50   535,122 ± 24,674  us/op
                toMap    10000           10  avgt   50   511,186 ±  3,444  us/op
           groupingBy    10000            2  avgt   50   340,984 ±  6,235  us/op
           groupingBy    10000            5  avgt   50   708,553 ±  6,369  us/op
           groupingBy    10000           10  avgt   50   712,858 ± 10,248  us/op
 groupingByConcurrent    10000            2  avgt   50   901,842 ±  8,685  us/op
 groupingByConcurrent    10000            5  avgt   50  3762,478 ± 21,408  us/op
 groupingByConcurrent    10000           10  avgt   50  3795,530 ± 32,096  us/op

I'm not so experienced with JMH, maybe I did something wrong here - suggestions and corrections are welcome:我对 JMH 不是很有经验,也许我在这里做错了 - 欢迎提出建议和更正:

import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.Random;
import java.util.concurrent.TimeUnit;
import java.util.function.Function;
import java.util.stream.Collectors;

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.infra.Blackhole;

@State(Scope.Thread)
public class ParallelWordCount
{

    @Param({"toConcurrentMap", "toMap", "groupingBy", "groupingByConcurrent"})
    public String method;

    @Param({"2", "5", "10"})
    public int wordLength;

    @Param({"1000", "10000" })
    public int count;

    private List<String> list;

    @Setup
    public void initList()
    {
         list = createRandomStrings(count, wordLength, new Random(0));
    }

    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @OutputTimeUnit(TimeUnit.MICROSECONDS)
    public void testMethod(Blackhole bh)
    {

        if (method.equals("toMap"))
        {
            Map<String, Integer> counts =
                list.stream().collect(
                    Collectors.toMap(
                        w -> w, w -> 1, Integer::sum));
            bh.consume(counts);
        }
        else if (method.equals("toConcurrentMap"))
        {
            Map<String, Integer> counts =
                list.parallelStream().collect(
                    Collectors.toConcurrentMap(
                        w -> w, w -> 1, Integer::sum));
            bh.consume(counts);
        }
        else if (method.equals("groupingBy"))
        {
            Map<String, Long> counts =
                list.stream().collect(
                    Collectors.groupingBy(
                        Function.identity(), Collectors.<String>counting()));
            bh.consume(counts);
        }
        else if (method.equals("groupingByConcurrent"))
        {
            Map<String, Long> counts =
                list.parallelStream().collect(
                    Collectors.groupingByConcurrent(
                        Function.identity(), Collectors.<String> counting()));
            bh.consume(counts);
        }
    }

    private static String createRandomString(int length, Random random)
    {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < length; i++)
        {
            int c = random.nextInt(26);
            sb.append((char) (c + 'a'));
        }
        return sb.toString();
    }

    private static List<String> createRandomStrings(
        int count, int length, Random random)
    {
        List<String> list = new ArrayList<String>(count);
        for (int i = 0; i < count; i++)
        {
            list.add(createRandomString(length, random));
        }
        return list;
    }
}

The times are only similar for the serial case of a list with 10000 elements, and 2-letter words.仅对于具有 10000 个元素和 2 个字母单词的列表的串行情况,时间相似。

It could be worthwhile to check whether for even larger list sizes, the concurrent versions eventually outperform the serial ones, but currently don't have the time for another detailed benchmark run with all these configurations.值得检查一下对于更大的列表大小,并发版本最终是否优于串行版本,但目前没有时间使用所有这些配置运行另一个详细的基准测试。

Find most frequent item in collection, with generics:使用泛型查找集合中最常见的项目:

private <V> V findMostFrequentItem(final Collection<V> items)
{
  return items.stream()
      .filter(Objects::nonNull)
      .collect(Collectors.groupingBy(Functions.identity(), Collectors.counting()))
      .entrySet()
      .stream()
      .max(Comparator.comparing(Entry::getValue))
      .map(Entry::getKey)
      .orElse(null);
}

Compute item frequencies:计算项目频率:

private <V> Map<V, Long> findFrequencies(final Collection<V> items)
{
  return items.stream()
      .filter(Objects::nonNull)
      .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));
}

If you use Eclipse Collections , you can just convert the List to a Bag .如果您使用Eclipse Collections ,您只需将List转换为Bag

Bag<String> words = 
    Lists.mutable.with("hello", "bye", "ciao", "bye", "ciao").toBag();

Assert.assertEquals(2, words.occurrencesOf("ciao"));
Assert.assertEquals(1, words.occurrencesOf("hello"));
Assert.assertEquals(2, words.occurrencesOf("bye"));

You can also create a Bag directly using the Bags factory class.您也可以直接使用Bags工厂类创建Bag

Bag<String> words = 
    Bags.mutable.with("hello", "bye", "ciao", "bye", "ciao");

This code will work with Java 5+.此代码适用于 Java 5+。

Note: I am a committer for Eclipse Collections注意:我是 Eclipse Collections 的提交者

I'll present the solution here which I made (the one with grouping is much better :) ).我将在这里介绍我制作的解决方案(分组的解决方案要好得多:))。

static private void test0(List<String> input) {
    Set<String> set = input.stream()
            .collect(Collectors.toSet());
    set.stream()
            .collect(Collectors.toMap(Function.identity(),
                    str -> Collections.frequency(input, str)));
}

Just my 0.02$只是我的 0.02 美元

Here's a way to create a frequency map using map functions.这是一种使用映射函数创建频率图的方法。

List<String> words = Stream.of("hello", "bye", "ciao", "bye", "ciao").collect(toList());
Map<String, Integer> frequencyMap = new HashMap<>();

words.forEach(word ->
        frequencyMap.merge(word, 1, (v, newV) -> v + newV)
);

System.out.println(frequencyMap); // {ciao=2, hello=1, bye=2}

Or或者

words.forEach(word ->
       frequencyMap.compute(word, (k, v) -> v != null ? v + 1 : 1)
);

you can use the Java 8 Streams你可以使用 Java 8 Streams

    Arrays.asList(s).stream()
          .collect(Collectors.groupingBy(Function.<String>identity(), 
          Collectors.<String>counting()));

Another 2 cent of mine, given an array:给定一个数组,我的另外 2 美分:

import static java.util.stream.Collectors.*;

String[] str = {"hello", "bye", "ciao", "bye", "ciao"};    
Map<String, Integer> collected 
= Arrays.stream(str)
        .collect(groupingBy(Function.identity(), 
                    collectingAndThen(counting(), Long::intValue)));
public class Main {

    public static void main(String[] args) {


        String testString ="qqwweerrttyyaaaaaasdfasafsdfadsfadsewfywqtedywqtdfewyfdweytfdywfdyrewfdyewrefdyewdyfwhxvsahxvfwytfx"; 
        long java8Case2 = testString.codePoints().filter(ch -> ch =='a').count();
        System.out.println(java8Case2);

        ArrayList<Character> list = new ArrayList<Character>();
        for (char c : testString.toCharArray()) {
          list.add(c);
        }
        Map<Object, Integer> counts = list.parallelStream().
            collect(Collectors.toConcurrentMap(
                w -> w, w -> 1, Integer::sum));
        System.out.println(counts);
    }

}

I think there is a more readable way:我认为有一种更具可读性的方式:

var words = List.of("my", "more", "more", "more", "simple", "way");
var count = words.stream().map(x -> Map.entry(x, 1))
                    .collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue, Integer::sum));

Similar to map-reduce approach, first map each word w to a ( w , 1).类似于 map-reduce 方法,首先将 map 中的每个单词w到 a ( w , 1)。 Then aggregate(reduce part) all pairs' count( Map.Entry::getValue ) where their key(word w ) is similar, ( Map.Entry::getKey ) and calculate the sum( Integer::sum ).然后聚合(减少部分)所有对的计数( Map.Entry::getValue ),其中它们的键(word w )相似,( Map.Entry::getKey )并计算总和( Integer::sum )。

The final terminal operation will return a HashMap<String, Integer> :最后的终端操作将返回一个HashMap<String, Integer>

{more=3, simple=1, my=1, way=1}
  public static void main(String[] args) {
    String str = "Hi Hello Hi";
    List<String> s = Arrays.asList(str.split(" "));
    Map<String, Long> hm = 
              s.stream().collect(Collectors.groupingBy(Function.identity(), 
              Collectors.counting()));

              hm.entrySet().forEach(entry -> {

             System.out.println(entry.getKey() + " " + entry.getValue());
              });

}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM