[英]Word frequency count Java 8
如何統計Java 8中List的詞頻?
List <String> wordsList = Lists.newArrayList("hello", "bye", "ciao", "bye", "ciao");
結果必須是:
{ciao=2, hello=1, bye=2}
我想分享我找到的解決方案,因為起初我希望使用 map-and-reduce 方法,但它有點不同。
Map<String,Long> collect = wordsList.stream()
.collect( Collectors.groupingBy( Function.identity(), Collectors.counting() ));
或者對於整數值:
Map<String,Integer> collect = wordsList.stream()
.collect( Collectors.groupingBy( Function.identity(), Collectors.summingInt(e -> 1) ));
編輯
我添加了如何按值對地圖進行排序:
LinkedHashMap<String, Long> countByWordSorted = collect.entrySet()
.stream()
.sorted(Map.Entry.comparingByValue(Comparator.reverseOrder()))
.collect(Collectors.toMap(
Map.Entry::getKey,
Map.Entry::getValue,
(v1, v2) -> {
throw new IllegalStateException();
},
LinkedHashMap::new
));
(注意:請參閱下面的編輯)
作為Mounas answer的替代方案,這是一種並行計算字數的方法:
import java.util.Arrays;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;
public class ParallelWordCount
{
public static void main(String[] args)
{
List<String> list = Arrays.asList(
"hello", "bye", "ciao", "bye", "ciao");
Map<String, Integer> counts = list.parallelStream().
collect(Collectors.toConcurrentMap(
w -> w, w -> 1, Integer::sum));
System.out.println(counts);
}
}
編輯作為對評論的回應,我用 JMH 進行了一個小測試,比較了
toConcurrentMap
和groupingByConcurrent
方法,不同的輸入列表大小和不同長度的隨機單詞。 該測試表明toConcurrentMap
方法更快。 當考慮這些方法“在幕后”有多么不同時,很難預測這樣的事情。作為進一步的擴展,基於進一步的評論,我擴展了測試以涵蓋
toMap
、groupingBy
、串行和並行的所有四種組合。結果仍然是
toMap
方法更快,但出乎意料(至少,對我來說)兩種情況下的“並發”版本都比串行版本慢......:
(method) (count) (wordLength) Mode Cnt Score Error Units
toConcurrentMap 1000 2 avgt 50 146,636 ± 0,880 us/op
toConcurrentMap 1000 5 avgt 50 272,762 ± 1,232 us/op
toConcurrentMap 1000 10 avgt 50 271,121 ± 1,125 us/op
toMap 1000 2 avgt 50 44,396 ± 0,541 us/op
toMap 1000 5 avgt 50 46,938 ± 0,872 us/op
toMap 1000 10 avgt 50 46,180 ± 0,557 us/op
groupingBy 1000 2 avgt 50 46,797 ± 1,181 us/op
groupingBy 1000 5 avgt 50 68,992 ± 1,537 us/op
groupingBy 1000 10 avgt 50 68,636 ± 1,349 us/op
groupingByConcurrent 1000 2 avgt 50 231,458 ± 0,658 us/op
groupingByConcurrent 1000 5 avgt 50 438,975 ± 1,591 us/op
groupingByConcurrent 1000 10 avgt 50 437,765 ± 1,139 us/op
toConcurrentMap 10000 2 avgt 50 712,113 ± 6,340 us/op
toConcurrentMap 10000 5 avgt 50 1809,356 ± 9,344 us/op
toConcurrentMap 10000 10 avgt 50 1813,814 ± 16,190 us/op
toMap 10000 2 avgt 50 341,004 ± 16,074 us/op
toMap 10000 5 avgt 50 535,122 ± 24,674 us/op
toMap 10000 10 avgt 50 511,186 ± 3,444 us/op
groupingBy 10000 2 avgt 50 340,984 ± 6,235 us/op
groupingBy 10000 5 avgt 50 708,553 ± 6,369 us/op
groupingBy 10000 10 avgt 50 712,858 ± 10,248 us/op
groupingByConcurrent 10000 2 avgt 50 901,842 ± 8,685 us/op
groupingByConcurrent 10000 5 avgt 50 3762,478 ± 21,408 us/op
groupingByConcurrent 10000 10 avgt 50 3795,530 ± 32,096 us/op
我對 JMH 不是很有經驗,也許我在這里做錯了 - 歡迎提出建議和更正:
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.Random;
import java.util.concurrent.TimeUnit;
import java.util.function.Function;
import java.util.stream.Collectors;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.infra.Blackhole;
@State(Scope.Thread)
public class ParallelWordCount
{
@Param({"toConcurrentMap", "toMap", "groupingBy", "groupingByConcurrent"})
public String method;
@Param({"2", "5", "10"})
public int wordLength;
@Param({"1000", "10000" })
public int count;
private List<String> list;
@Setup
public void initList()
{
list = createRandomStrings(count, wordLength, new Random(0));
}
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public void testMethod(Blackhole bh)
{
if (method.equals("toMap"))
{
Map<String, Integer> counts =
list.stream().collect(
Collectors.toMap(
w -> w, w -> 1, Integer::sum));
bh.consume(counts);
}
else if (method.equals("toConcurrentMap"))
{
Map<String, Integer> counts =
list.parallelStream().collect(
Collectors.toConcurrentMap(
w -> w, w -> 1, Integer::sum));
bh.consume(counts);
}
else if (method.equals("groupingBy"))
{
Map<String, Long> counts =
list.stream().collect(
Collectors.groupingBy(
Function.identity(), Collectors.<String>counting()));
bh.consume(counts);
}
else if (method.equals("groupingByConcurrent"))
{
Map<String, Long> counts =
list.parallelStream().collect(
Collectors.groupingByConcurrent(
Function.identity(), Collectors.<String> counting()));
bh.consume(counts);
}
}
private static String createRandomString(int length, Random random)
{
StringBuilder sb = new StringBuilder();
for (int i = 0; i < length; i++)
{
int c = random.nextInt(26);
sb.append((char) (c + 'a'));
}
return sb.toString();
}
private static List<String> createRandomStrings(
int count, int length, Random random)
{
List<String> list = new ArrayList<String>(count);
for (int i = 0; i < count; i++)
{
list.add(createRandomString(length, random));
}
return list;
}
}
僅對於具有 10000 個元素和 2 個字母單詞的列表的串行情況,時間相似。
值得檢查一下對於更大的列表大小,並發版本最終是否優於串行版本,但目前沒有時間使用所有這些配置運行另一個詳細的基准測試。
使用泛型查找集合中最常見的項目:
private <V> V findMostFrequentItem(final Collection<V> items)
{
return items.stream()
.filter(Objects::nonNull)
.collect(Collectors.groupingBy(Functions.identity(), Collectors.counting()))
.entrySet()
.stream()
.max(Comparator.comparing(Entry::getValue))
.map(Entry::getKey)
.orElse(null);
}
計算項目頻率:
private <V> Map<V, Long> findFrequencies(final Collection<V> items)
{
return items.stream()
.filter(Objects::nonNull)
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));
}
如果您使用Eclipse Collections ,您只需將List
轉換為Bag
。
Bag<String> words =
Lists.mutable.with("hello", "bye", "ciao", "bye", "ciao").toBag();
Assert.assertEquals(2, words.occurrencesOf("ciao"));
Assert.assertEquals(1, words.occurrencesOf("hello"));
Assert.assertEquals(2, words.occurrencesOf("bye"));
您也可以直接使用Bags
工廠類創建Bag
。
Bag<String> words =
Bags.mutable.with("hello", "bye", "ciao", "bye", "ciao");
此代碼適用於 Java 5+。
注意:我是 Eclipse Collections 的提交者
我將在這里介紹我制作的解決方案(分組的解決方案要好得多:))。
static private void test0(List<String> input) {
Set<String> set = input.stream()
.collect(Collectors.toSet());
set.stream()
.collect(Collectors.toMap(Function.identity(),
str -> Collections.frequency(input, str)));
}
只是我的 0.02 美元
這是一種使用映射函數創建頻率圖的方法。
List<String> words = Stream.of("hello", "bye", "ciao", "bye", "ciao").collect(toList());
Map<String, Integer> frequencyMap = new HashMap<>();
words.forEach(word ->
frequencyMap.merge(word, 1, (v, newV) -> v + newV)
);
System.out.println(frequencyMap); // {ciao=2, hello=1, bye=2}
或者
words.forEach(word ->
frequencyMap.compute(word, (k, v) -> v != null ? v + 1 : 1)
);
你可以使用 Java 8 Streams
Arrays.asList(s).stream()
.collect(Collectors.groupingBy(Function.<String>identity(),
Collectors.<String>counting()));
給定一個數組,我的另外 2 美分:
import static java.util.stream.Collectors.*;
String[] str = {"hello", "bye", "ciao", "bye", "ciao"};
Map<String, Integer> collected
= Arrays.stream(str)
.collect(groupingBy(Function.identity(),
collectingAndThen(counting(), Long::intValue)));
public class Main {
public static void main(String[] args) {
String testString ="qqwweerrttyyaaaaaasdfasafsdfadsfadsewfywqtedywqtdfewyfdweytfdywfdyrewfdyewrefdyewdyfwhxvsahxvfwytfx";
long java8Case2 = testString.codePoints().filter(ch -> ch =='a').count();
System.out.println(java8Case2);
ArrayList<Character> list = new ArrayList<Character>();
for (char c : testString.toCharArray()) {
list.add(c);
}
Map<Object, Integer> counts = list.parallelStream().
collect(Collectors.toConcurrentMap(
w -> w, w -> 1, Integer::sum));
System.out.println(counts);
}
}
我認為有一種更具可讀性的方式:
var words = List.of("my", "more", "more", "more", "simple", "way");
var count = words.stream().map(x -> Map.entry(x, 1))
.collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue, Integer::sum));
類似於 map-reduce 方法,首先將 map 中的每個單詞w到 a ( w , 1)。 然后聚合(減少部分)所有對的計數( Map.Entry::getValue
),其中它們的鍵(word w )相似,( Map.Entry::getKey
)並計算總和( Integer::sum
)。
最后的終端操作將返回一個HashMap<String, Integer>
:
{more=3, simple=1, my=1, way=1}
public static void main(String[] args) {
String str = "Hi Hello Hi";
List<String> s = Arrays.asList(str.split(" "));
Map<String, Long> hm =
s.stream().collect(Collectors.groupingBy(Function.identity(),
Collectors.counting()));
hm.entrySet().forEach(entry -> {
System.out.println(entry.getKey() + " " + entry.getValue());
});
}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.