简体   繁体   English

字符计数在Java中不累加

[英]Character counts do not add up in Java

I am writing a Java program that goes through a file and provides counts of characters. 我正在编写一个遍历文件并提供字符计数的Java程序。 The problem I'm having is that my counts aren't adding up. 我遇到的问题是我的人数没有增加。 When I add the isAlphabetic(char c) and isDigit(char c) , they do not equal the isLetterOrDigit(char c) method (please forgive me if I'm using the wrong terminology). 当我添加isAlphabetic(char c)isDigit(char c) ,它们不等于isLetterOrDigit(char c)方法(如果我使用的术语错误,请原谅我)。

What am I missing? 我想念什么? Here is a copy of my code so far. 到目前为止,这是我的代码的副本。

for (String word : words) {
                char[] ch = word.toCharArray();
                for (int i = 0; i < word.length(); i++) {
                    if (Character.isBmpCodePoint(ch[i])) {
                        charCount++;
                        if (Character.isLetterOrDigit(ch[i])) {
                            alphnumCount++;
                        }
                        if (Character.isAlphabetic(ch[i])) {
                            alphabetCount++;
                        }
                        if (Character.isDigit(ch[i])) {
                            numericCount++;
                        }
                    }
                }
            }
            // Reading next line into currentLine
            currentLine = reader.readLine();
        }
        // Printing charCount, wordCount and lineCount
        System.out.println("Number Of Chars In..Lab.docx File : " + charCount);
        System.out.println("Number Of Alph+Numeric Chars In..Lab.docx File : " + alphnumCount);
        System.out.println("Number Of Alphabet Chars In..Lab.docx File : " + alphabetCount);
        System.out.println("Number Of Numeric Chars In..Lab.docx File : " + numericCount);
        System.out.println("Number Of Words In..Lab.docx File : " + wordCount);
        System.out.println("Number Of Lines In..Lab.docx File : " + lineCount);
        System.out.println(alphabetCount + numericCount - alphnumCount);

        reader.close(); // Closing the reader
    }
}

I think the Problem here is that you are reading an *.docx file. 我认为这里的问题是您正在阅读*.docx文件。

If I use a simple text file with the following content your script works. 如果我使用具有以下内容的简单文本文件,则脚本可以工作。

Test123
7asdf

The output is: 输出为:

Number Of Chars In..CSCI_1136_Lab6.docx File : 12
Number Of Alph+Numeric Chars In..CSCI_1136_Lab6.docx File : 12
Number Of Alphabet Chars In..CSCI_1136_Lab6.docx File : 8
Number Of Numeric Chars In..CSCI_1136_Lab6.docx File : 4
Number Of Words In..CSCI_1136_Lab6.docx File : 2
Number Of Lines In..CSCI_1136_Lab6.docx File : 2
0

If you want to count the characters in a *.docx file that is not possible in this way, because you are interpreting the bytes of that file as String, which they don't are. 如果要在*.docx文件中计算字符,则用这种方法是不可能的,因为您正在将该文件的字节解释为String,而字节不是。

DOCX is written in an XML format, which consists of a ZIP archive file containing XML and binaries. DOCX以XML格式编写,该格式由包含XML和二进制文件的ZIP存档文件组成。

From forensicswiki . 来自法医维基

So *.docx files are not stored in plain text, which you are expecting in your code. 因此, *.docx文件不会以纯文本形式存储,这在您的代码中是期望的。

Another point is that you are using Character.isAlphabetic() instead of Character.isLetter() : 另一点是您使用的是Character.isAlphabetic()而不是Character.isLetter()

Form the docs for Character.isAlphabetic() : 形成用于Character.isAlphabetic()文档

Determines if the specified character (Unicode code point) is an alphabet. 确定指定的字符(Unicode代码点)是否为字母。

A character is considered to be alphabetic if its general category type, provided by getType(codePoint), is any of the following: 如果getType(codePoint)提供的字符的常规类别类型为以下任意一种,则认为该字符为字母:

  • UPPERCASE_LETTER 大写字母
  • LOWERCASE_LETTER 小写字母
  • TITLECASE_LETTER TITLECASE_LETTER
  • MODIFIER_LETTER MODIFIER_LETTER
  • OTHER_LETTER OTHER_LETTER
  • LETTER_NUMBER LETTER_NUMBER

or it has contributory property Other_Alphabetic as defined by the Unicode Standard. 或具有Unicode标准定义的贡献性属性Other_Alphabetic。

Form the docs for Character.isLetter() : 形成用于Character.isLetter()文档

Determines if the specified character (Unicode code point) is a letter. 确定指定的字符(Unicode代码点)是否为字母。

A character is considered to be a letter if its general category type, provided by Character.getType(ch), is any of the following: 如果Character.getType(ch)提供的常规类别类型为以下任意一种,则认为该字符为字母:

  • UPPERCASE_LETTER 大写字母
  • LOWERCASE_LETTER 小写字母
  • TITLECASE_LETTER TITLECASE_LETTER
  • MODIFIER_LETTER MODIFIER_LETTER
  • OTHER_LETTER OTHER_LETTER

Not all letters have case. 并非所有字母都有大小写。 Many characters are letters but are neither uppercase nor lowercase nor titlecase. 许多字符是字母,但既不是大写字母也不是小写字母也不是标题字母。

So there is a difference between both methods. 因此,这两种方法之间存在差异。 The method Character.isLetterOrDigit() is using isLetter() and isDigit() ; 方法Character.isLetterOrDigit()使用isLetter()isDigit() ;

Determines if the specified character (Unicode code point) is a letter or digit. 确定指定的字符(Unicode代码点)是字母还是数字。

A character is considered to be a letter or digit if either isLetter(codePoint) or isDigit(codePoint) returns true for the character. 如果isLetter(codePoint)或isDigit(codePoint)对该字符返回true,则该字符被视为字母或数字。

From the docs for Character.isLetterOrDigit() . 来自Character.isLetterOrDigit()文档

So if you use Character.isLetter() instead of Character.isAlphabetic() your result should be correct. 因此,如果使用Character.isLetter()而不是Character.isAlphabetic()结果应该正确。

This is my result for a *.docx file using Character.isLetter() : 这是我使用Character.isLetter()获得*.docx文件的结果:

Number Of Chars In..CSCI_1136_Lab6.docx File : 5923
Number Of Alph+Numeric Chars In..CSCI_1136_Lab6.docx File : 1758
Number Of Alphabet Chars In..CSCI_1136_Lab6.docx File : 1550
Number Of Numeric Chars In..CSCI_1136_Lab6.docx File : 208
Number Of Words In..CSCI_1136_Lab6.docx File : 66
Number Of Lines In..CSCI_1136_Lab6.docx File : 48
0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM