简体   繁体   English

计算文本中单词出现的次数 java

[英]counting number of occurrences of words in a text java

So I'm building a TreeMap from scratch and I'm trying to count the number of occurrences of every word in a text using Java.因此,我正在从头开始构建 TreeMap,并尝试使用 Java 计算文本中每个单词的出现次数。 The text is read from a text file, but I can easily read it from there.文本是从文本文件中读取的,但我可以轻松地从那里读取。 I really don't know how to count every word, can someone help?我真的不知道如何计算每个单词,有人可以帮忙吗?

Imagine the text is something like:想象一下文本是这样的:

Over time, computer engineers take advantage of each other's work and invent algorithms for new things.随着时间的推移,计算机工程师会利用彼此的工作并为新事物发明算法。 Algorithms combine with other algorithms to utilize the results of other algorithms, in turn producing results for even more algorithms.算法与其他算法相结合以利用其他算法的结果,进而产生更多算法的结果。

Output: 
Over 1
time 1
computer 1
algotitms 5
...

If possible I want to ignore if it's upper or lower case, I want to count them both together.如果可能的话,我想忽略它是大写还是小写,我想将它们一起计算。

EDIT: I don't want to use any sort of Map (hashMap ie) or something similiar to do this.编辑:我不想使用任何类型的 Map (hashMap ie) 或类似的东西来做到这一点。

Break down the problem as follows (this is one potential solution - not THE solution):将问题分解如下(这是一种潜在的解决方案 - 不是解决方案):

  1. Split the text into words (create list or array or words).将文本拆分为单词(创建列表或数组或单词)。
  2. Remove punctuation marks.去掉标点符号。
  3. Create your map to collect results.创建您的 map 以收集结果。
  4. Iterate over your list of words and add "1" to the value of each encountered key遍历您的单词列表并将“1”添加到每个遇到的键的值
  5. Display results (Iterate over the map's EntrySet )显示结果(遍历地图的EntrySet

Split the text into words将文本拆分为单词

My preference is to split words by using space as a delimiter.我的偏好是使用空格作为分隔符来分割单词。 The reason being is that, if you split using non-word characters, you may missed on some hyphenated words.原因是,如果您使用非单词字符进行拆分,您可能会错过一些连字符。 I know that the use of hyphenation is being reduced, there are still plenty of words that fall under this rule;我知道连字符的使用正在减少,仍然有很多单词属于这条规则; for example, middle-aged.例如,中年人。 If a word such as this is encountered, it MIGHT have to be treated as one word and not two.如果遇到这样的单词,它可能必须被视为一个单词而不是两个单词。

Remove punctuation marks去除标点符号

Because of the decision above, you will need to first remove punctuation characters that might attached to your words.由于上述决定,您需要首先删除可能附加在您的单词上的标点符号。 Keep in mind that if you use a regular expression to split the words, you might be able to accomplish this step at the same time you are doing the step above.请记住,如果您使用正则表达式来拆分单词,您可能可以在执行上述步骤的同时完成此步骤。 In fact, that would be preferred so that you don't have to iterate over twice.事实上,这将是首选,这样您就不必迭代两次。 Do both of these in a single pass.一次性完成这两项工作。 While you at it, call toLowerCase() on the input string to eliminate the ambiguity between capitalized words and lowercase words.当您使用它时,在输入字符串上调用toLowerCase()以消除大写单词和小写单词之间的歧义。

Create your map to collect results创建您的 map 以收集结果

This is where you are going to collect your count.这是您要收集计数的地方。 Using the TreeMap implementation of the Java Map .使用 Java MapTreeMap实现。 One thing to be aware about this particular implementation is that the map is sorted according to the natural ordering of its keys .关于这个特定实现需要注意的一件事是map 根据其键的自然顺序进行排序 In this case, since the keys are the words from the inputted text, the keys will be arranged in alphabetical order, not by the magnitude of the count.在这种情况下,由于键是输入文本中的单词,因此键将按字母顺序排列,而不是按计数的大小排列。 IF sorting the entries by count is important, there is a technique where you can "reverse" the map and make the values the keys and the keys to values.如果按计数对条目进行排序很重要,则有一种技术可以“反转” map 并使值成为键,键成为值。 However, since two or more words could have the same count, you will need to create a new map of <Integer, Set>, so that you can group together words with the same count.但是,由于两个或多个单词可能具有相同的计数,因此您需要创建一个 <Integer, Set> 的新 map,以便您可以将具有相同计数的单词组合在一起。

Iterate over your list of words遍历您的单词列表

At this point, you should have a list of words and a map structure to collect the count.此时,您应该有一个单词列表和一个 map 结构来收集计数。 Using a lambda expression, you should be able to perform a count() or your words very easily.使用 lambda 表达式,您应该能够非常轻松地执行count()或您的单词。 But, if you are not familiarized or comfortable with Lambda expressions, you can use a regular looping structure to iterate over your list, do a containsKey() check to see if the word was encountered before, get() the value if the map already contains the word, and then add "1" to the previous value.但是,如果您不熟悉或不熟悉 Lambda 表达式,您可以使用常规循环结构来遍历您的列表,执行containsKey() get()是否之前遇到过该单词,如果 map 已经包含单词,然后将“1”添加到前一个值。 Lastly, put() the new count in the map.最后, put()新计数放入 map 中。

Display results显示结果

Again, you can use a Lambda Expression to print out the EntrySet key value pairs or simply iterate over the entry set to display the results.同样,您可以使用 Lambda 表达式打印出EntrySet键值对或简单地遍历条目集以显示结果。

Based on all of the above points, a potential solution should look like this (not using Lambda for the OPs sake)基于以上所有几点,一个潜在的解决方案应该是这样的(为了 OP 而不是使用 Lambda)

public static void main(String[] args) {
    String text = "Over time, computer engineers take advantage of each other's work and invent algorithms for new things. Algorithms combine with other algorithms to utilize the results of other algorithms, in turn producing results for even more algorithms.";
    
    text = text.replaceAll("\\p{P}", ""); // replace all punctuations
    text = text.toLowerCase(); // turn all words into lowercase
    String[] wordArr = text.split(" "); // create list of words

    Map<String, Integer> wordCount = new TreeMap<>();
    
    // Collect the word count
    for (String word : wordArr) {
        if(!wordCount.containsKey(word)){
            wordCount.put(word, 1);
        } else {
            int count = wordCount.get(word);
            wordCount.put(word, count + 1);
        }
    }
    
    Iterator<Entry<String, Integer>> iter = wordCount.entrySet().iterator();
    
    System.out.println("Output: ");
    while(iter.hasNext()) {
        Entry<String, Integer> entry = iter.next();
        System.out.println(entry.getKey() + ": " + entry.getValue());
    }
}

This produces the following output这将产生以下 output

Output: 
advantage: 1
algorithms: 5
and: 1
combine: 1
computer: 1
each: 1
engineers: 1
even: 1
for: 2
in: 1
invent: 1
more: 1
new: 1
of: 2
other: 2
others: 1
over: 1
producing: 1
results: 2
take: 1
the: 1
things: 1
time: 1
to: 1
turn: 1
utilize: 1
with: 1
work: 1

Why did I break down the problem like this for such mundane task?为什么我要为如此平凡的任务分解这样的问题? Simple.简单的。 I believe each of those discrete steps should be extracted into functions to improve code reusability.我相信这些离散步骤中的每一个都应该被提取到函数中以提高代码的可重用性。 Yes, it is cool to use a Lambda expression to do everything at once and make your code look much simplified.是的,使用 Lambda 表达式一次完成所有操作并让您的代码看起来更加简化,这很酷。 But what if you need to some intermediate step over and over?但是,如果您需要一遍又一遍地执行一些中间步骤怎么办? Most of the time, code is duplicated to accomplish this.大多数时候,重复代码来实现这一点。 In reality, often a better solution is to break these tasks into methods.实际上,通常更好的解决方案是将这些任务分解为方法。 Some of these tasks, like transforming the input text, can be done in a single method since that activity seems to be related in nature.其中一些任务,例如转换输入文本,可以在单一方法中完成,因为该活动似乎在本质上是相关的。 (There is such a thing as a method doing "too little.") (有一种方法“做得太少”。)

public String[] createWordList(String text) {
    return text.replaceAll("\\p{P}", "").toLowerCase().split(" ");
}

public Map<String, Integer> createWordCountMap(String[] wordArr) {
    Map<String, Integer> wordCountMap = new TreeMap<>();

    for (String word : wordArr) {
        if(!wordCountMap.containsKey(word)){
            wordCountMap.put(word, 1);
        } else {
            int count = wordCountMap.get(word);
            wordCountMap.put(word, count + 1);
        }
    }

return wordCountMap;
}

String void displayCount(Map<String, Integer> wordCountMap) {
    Iterator<Entry<String, Integer>> iter = wordCountMap.entrySet().iterator();
    
    while(iter.hasNext()) {
        Entry<String, Integer> entry = iter.next();
        System.out.println(entry.getKey() + ": " + entry.getValue());
    }
}

Now, after doing that, your main method looks more readable and your code is more reusable.现在,在这样做之后,您的main方法看起来更具可读性,并且您的代码更具可重用性。

public static void main(String[] args) {
    
    WordCount wc = new WordCount();
    String text = "...";
    
    String[] wordArr = wc.createWordList(text);
    Map<String, Integer> wordCountMap = wc.createWordCountMap(wordArr);
    wc.displayCount(wordCountMap);
}

UPDATE :更新

One small detail I forgot to mention is that, if instead of a TreeMap a HashMap is used, the output will come sorted by count value in descending order.我忘记提及的一个小细节是,如果使用HashMap而不是TreeMap ,则 output 将按计数值降序排序。 This is because the hashing function will use value of the entry as the hash.这是因为散列 function 将使用条目的值作为 hash。 Therefore, you won't need to "reverse" the map for this purpose.因此,您无需为此目的“反转” map。 So, after switching to HashMap , the output should be as follows:所以,切换到HashMap后, output 应该如下:

Output: 
algorithms: 5
other: 2
for: 2
turn: 1
computer: 1
producing: 1
...

my suggestion is to use regexp and split and stream with grouping I think that's what you mean, but I'm not sure if I used too much for the list我的建议是使用 regexp 和 split 以及 stream 进行分组我认为这就是你的意思,但我不确定我是否在列表中使用了太多

@Test
public void testApp() {
    final String text = "Over time, computer engineers take advantage of each other's work and invent algorithms for new things. Algorithms combine with other algorithms to utilize the results of other algorithms, in turn producing results for even more algorithms.";
    final String[] split = text.split("\\W+");
    final List<String> list = new ArrayList<>();
    System.out.println("Output: ");
    for (String s : split) {
        if(!list.contains(s)){
            list.add(s.toUpperCase());
            final long count = Arrays.stream(split).filter(s::equalsIgnoreCase).count();
            System.out.println(s+" "+count);
        }
    }

}

below is a test for your example but use MAP下面是您的示例的测试,但使用 MAP

@Test
public void test() {
    final String text = "Over time, computer engineers take advantage of each other's work and invent algorithms for new things. Algorithms combine with other algorithms to utilize the results of other algorithms, in turn producing results for even more algorithms.";
    Map<String, Long> result = Arrays.stream(text.split("\\W+")).collect(Collectors.groupingBy(String::toLowerCase, Collectors.counting()));
    assertEquals(result.get("algorithms"), new Long(5));
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM