简体   繁体   English

使用Java8计算字符串列表中每个单词的频率

[英]Count frequency of each word from list of Strings using Java8

I have two lists of Strings.我有两个字符串列表。 Need to create a map of occurrences of each string of one list in another list of string.需要创建一个 map 一个列表的每个字符串在另一个字符串列表中的出现次数。 If a String is present even more than in a single string, it should be counted as one occurrence.如果一个字符串出现的次数甚至多于一个字符串,则应将其计为一次。

For example:例如:

String[] listA={"the", "you" , "how"}; 
String[] listB = {"the dog ate the food", "how is the weather" , "how are you"};

The Map<String, Integer> map will take keys as Strings from listA , and value as the occurence. Map<String, Integer> map将 key 作为来自listA的字符串,并将 value 作为出现。 So map will have key-values as: ("the",2)("you",1)("how",2) .所以 map 的键值是: ("the",2)("you",1)("how",2)

Note: Though "the" is repeated twice in "the dog ate the food" , it counted as only one occurrence as it is in the same string.注意:虽然"the""the dog ate the food"中重复了两次,但因为它出现在同一个字符串中,所以只计算了一次。

How do I write this using ?如何使用编写这个? I tried this approach but does not work:我尝试了这种方法但不起作用:

Set<String> sentenceSet = Stream.of(listB).collect(Collectors.toSet());

Map<String, Long> frequency1 =  Stream.of(listA)
    .filter(e -> sentenceSet.contains(e))
    .collect(Collectors.groupingBy(t -> t, Collectors.counting()));

You need to extract all the words from listB and keep only these that are also listed in listA .您需要从listB中提取所有单词,并仅保留那些也在listA中列出的单词。 Then you simply collect the pairs word -> count to the Map<String, Long> :然后,您只需将对 word -> count 收集到Map<String, Long>

String[] listA={"the", "you", "how"};
String[] listB = {"the dog ate the food", "how is the weather" , "how are you"};

Set<String> qualified = new HashSet<>(Arrays.asList(listA));   // make searching easier

Map<String, Long> map = Arrays.stream(listB)   // stream the sentences
    .map(sentence -> sentence.split("\\s+"))   // split by words to Stream<String[]>
    .flatMap(words -> Arrays.stream(words)     // flatmap to Stream<String>
                            .distinct())       // ... as distinct words by sentence
    .filter(qualified::contains)               // keep only the qualified words
    .collect(Collectors.groupingBy(            // collect to the Map
        Function.identity(),                   // ... the key is the words itself
        Collectors.counting()));               // ... the value is its frequency

Output: Output:

{the=2, how=2, you=1} {the=2, how=2, you=1}

Suggest you create a hash table of the items in the first string.建议您为第一个字符串中的项目创建一个 hash 表。 Then loop through the items in the second list checking if it is in the hash table or not.然后遍历第二个列表中的项目,检查它是否在 hash 表中。 When adding the elements in the first list, test to see if it's already there and decide if you want to keep a count or not.在第一个列表中添加元素时,测试它是否已经存在并决定是否要保留计数。 You can store which sentence a word is in as the value for the key, for instance.例如,您可以将单词所在的句子存储为键的值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM