简体   繁体   English

计算URL中字母的出现次数

[英]Count occurrences of letters from a URL

I'm trying to count the occurrence of each letter from a URL. 我正在尝试计算URL中每个字母的出现次数。

I found this code, which seems to do the trick, but there are a few things I would hope to get explained. 我找到了这段代码,这似乎可以解决问题,但是我希望能解释一些事情。

1) I'm using a Norwegian alphabet so I need to add three more letters. 1)我使用的是挪威字母,因此我需要再添加三个字母。 I changed the array to 29, but it did not work. 我将数组更改为29,但是没有用。

2) Could you please explain to me what %c%7d\\n means? 2)您能否向我解释%c%7d\\n是什么意思?

01  import java.io.FileReader;
02  import java.io.IOException;
03   
04   
05  public class FrequencyAnalysis {
06      public static void main(String[] args) throws IOException {
07      FileReader reader = new FileReader("PlainTextDocument.txt");
08   
09      System.out.println("Letter Frequency");
10   
11      int nextChar;
12      char ch;
13   
14      // Declare 26 char counting
15      int[] count = new int[26];
16   
17      //Loop through the file char
18      while ((nextChar = reader.read()) != -1) {
19          ch = Character.toLowerCase((char) nextChar);
20   
21          if (ch >= 'a' && ch <= 'z')
22          count[ch - 'a']++;
23      }
24   
25      // Print out
26      for (int i = 0; i < 26; i++) {
27          System.out.printf("%c%7d\n", i + 'A', count[i]);
28      }
29   
30      reader.close();
31      }
32  }

You havent said how you checked for the 3 additional letters. 您还没有说过如何检查另外3个字母。 It not enough to increase the size of the count array. 仅增加count数组的大小还不够。 You will need to account for the new characters unicode point values here. 您将需要在此处考虑新字符的unicode点值。 Chances are that the values are no longer conveniently sequentially ordered. 可能不再方便地按顺序对值进行排序。 In that case, you can use a Map<Integer, Integer> to store the frequencies. 在这种情况下,可以使用Map<Integer, Integer>来存储频率。

%c is the format specifier for a unicode character. %c是Unicode字符的格式说明符。 %7d is the specifier for integer with leftmost space padding. %7d是带有最左空格填充的整数的说明符。 \\n is a newline character \\n是换行符

Documented in the Formatter javadoc 记录在Formatter javadoc中

An important thing here is that when you want to increment the number of occurences in your array, you are implicitly using the ASCII code of the characters : 这里重要的一点是,当您想增加数组中出现的次数时,隐式使用了字符的ASCII代码:

//Here, ch is a char.
ch = Character.toLowerCase((char) nextChar);

  //I hate *if statements* without curly brackets but this is off-topic :)
  if (ch >= 'a' && ch <= 'z')

    /*
     * but here, ch is implicitly cast to an integer.
     * The int value of a char is its ASCII code.
     * for example, the value of 'a' is 97.
     * So if ch is 'a', (ch - 'a') = (97 - 97) = 0.
     * That's why you are incrementing count[0] in this case.
     *
     * Now, what happens if ch ='ø'? What is the ASCII code of ø?
     * Probably something quite high so that ch-'a' is probably out of bounds
     * but the size of your array is 26+3 only.
     *
     * EDIT : after a quick test, 'ø' = 248.
     *
     * This would work if norvegian characters had ASCII code between 98 and 100.
     */
     count[ch - 'a']++;

You should rewrite the algorithm using a HashMap<Character, Integer> instead. 您应该改用HashMap<Character, Integer>重写算法。

//HashMap<Character, nb occurences of this character>
HashMap<Character, Integer> map = new HashMap<Character, Integer>();

while ((nextChar = reader.read()) != -1) {
  if(!map.containsKey(nextChar)) {
    map.put(nextChar, 0);
  }
  map.put(nextChar, map.get(nextChar)+1);
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM