[英]Character counts do not add up in Java
I am writing a Java program that goes through a file and provides counts of characters. 我正在编写一个遍历文件并提供字符计数的Java程序。 The problem I'm having is that my counts aren't adding up. 我遇到的问题是我的人数没有增加。 When I add the isAlphabetic(char c)
and isDigit(char c)
, they do not equal the isLetterOrDigit(char c)
method (please forgive me if I'm using the wrong terminology). 当我添加isAlphabetic(char c)
和isDigit(char c)
,它们不等于isLetterOrDigit(char c)
方法(如果我使用的术语错误,请原谅我)。
What am I missing? 我想念什么? Here is a copy of my code so far. 到目前为止,这是我的代码的副本。
for (String word : words) {
char[] ch = word.toCharArray();
for (int i = 0; i < word.length(); i++) {
if (Character.isBmpCodePoint(ch[i])) {
charCount++;
if (Character.isLetterOrDigit(ch[i])) {
alphnumCount++;
}
if (Character.isAlphabetic(ch[i])) {
alphabetCount++;
}
if (Character.isDigit(ch[i])) {
numericCount++;
}
}
}
}
// Reading next line into currentLine
currentLine = reader.readLine();
}
// Printing charCount, wordCount and lineCount
System.out.println("Number Of Chars In..Lab.docx File : " + charCount);
System.out.println("Number Of Alph+Numeric Chars In..Lab.docx File : " + alphnumCount);
System.out.println("Number Of Alphabet Chars In..Lab.docx File : " + alphabetCount);
System.out.println("Number Of Numeric Chars In..Lab.docx File : " + numericCount);
System.out.println("Number Of Words In..Lab.docx File : " + wordCount);
System.out.println("Number Of Lines In..Lab.docx File : " + lineCount);
System.out.println(alphabetCount + numericCount - alphnumCount);
reader.close(); // Closing the reader
}
}
I think the Problem here is that you are reading an *.docx
file. 我认为这里的问题是您正在阅读*.docx
文件。
If I use a simple text file with the following content your script works. 如果我使用具有以下内容的简单文本文件,则脚本可以工作。
Test123
7asdf
The output is: 输出为:
Number Of Chars In..CSCI_1136_Lab6.docx File : 12
Number Of Alph+Numeric Chars In..CSCI_1136_Lab6.docx File : 12
Number Of Alphabet Chars In..CSCI_1136_Lab6.docx File : 8
Number Of Numeric Chars In..CSCI_1136_Lab6.docx File : 4
Number Of Words In..CSCI_1136_Lab6.docx File : 2
Number Of Lines In..CSCI_1136_Lab6.docx File : 2
0
If you want to count the characters in a *.docx
file that is not possible in this way, because you are interpreting the bytes of that file as String, which they don't are. 如果要在*.docx
文件中计算字符,则用这种方法是不可能的,因为您正在将该文件的字节解释为String,而字节不是。
DOCX is written in an XML format, which consists of a ZIP archive file containing XML and binaries. DOCX以XML格式编写,该格式由包含XML和二进制文件的ZIP存档文件组成。
From forensicswiki . 来自法医维基 。
So *.docx
files are not stored in plain text, which you are expecting in your code. 因此, *.docx
文件不会以纯文本形式存储,这在您的代码中是期望的。
Another point is that you are using Character.isAlphabetic()
instead of Character.isLetter()
: 另一点是您使用的是Character.isAlphabetic()
而不是Character.isLetter()
:
Form the docs for Character.isAlphabetic()
: 形成用于Character.isAlphabetic()
的文档 :
Determines if the specified character (Unicode code point) is an alphabet. 确定指定的字符(Unicode代码点)是否为字母。
A character is considered to be alphabetic if its general category type, provided by getType(codePoint), is any of the following: 如果getType(codePoint)提供的字符的常规类别类型为以下任意一种,则认为该字符为字母:
- UPPERCASE_LETTER 大写字母
- LOWERCASE_LETTER 小写字母
- TITLECASE_LETTER TITLECASE_LETTER
- MODIFIER_LETTER MODIFIER_LETTER
- OTHER_LETTER OTHER_LETTER
- LETTER_NUMBER LETTER_NUMBER
or it has contributory property Other_Alphabetic as defined by the Unicode Standard. 或具有Unicode标准定义的贡献性属性Other_Alphabetic。
Form the docs for Character.isLetter()
: 形成用于Character.isLetter()
的文档 :
Determines if the specified character (Unicode code point) is a letter. 确定指定的字符(Unicode代码点)是否为字母。
A character is considered to be a letter if its general category type, provided by Character.getType(ch), is any of the following: 如果Character.getType(ch)提供的常规类别类型为以下任意一种,则认为该字符为字母:
- UPPERCASE_LETTER 大写字母
- LOWERCASE_LETTER 小写字母
- TITLECASE_LETTER TITLECASE_LETTER
- MODIFIER_LETTER MODIFIER_LETTER
- OTHER_LETTER OTHER_LETTER
Not all letters have case. 并非所有字母都有大小写。 Many characters are letters but are neither uppercase nor lowercase nor titlecase. 许多字符是字母,但既不是大写字母也不是小写字母也不是标题字母。
So there is a difference between both methods. 因此,这两种方法之间存在差异。 The method Character.isLetterOrDigit()
is using isLetter()
and isDigit()
; 方法Character.isLetterOrDigit()
使用isLetter()
和isDigit()
;
Determines if the specified character (Unicode code point) is a letter or digit. 确定指定的字符(Unicode代码点)是字母还是数字。
A character is considered to be a letter or digit if either isLetter(codePoint) or isDigit(codePoint) returns true for the character. 如果isLetter(codePoint)或isDigit(codePoint)对该字符返回true,则该字符被视为字母或数字。
From the docs for Character.isLetterOrDigit()
. 来自Character.isLetterOrDigit()
的文档 。
So if you use Character.isLetter()
instead of Character.isAlphabetic()
your result should be correct. 因此,如果使用Character.isLetter()
而不是Character.isAlphabetic()
结果应该正确。
This is my result for a *.docx
file using Character.isLetter()
: 这是我使用Character.isLetter()
获得*.docx
文件的结果:
Number Of Chars In..CSCI_1136_Lab6.docx File : 5923
Number Of Alph+Numeric Chars In..CSCI_1136_Lab6.docx File : 1758
Number Of Alphabet Chars In..CSCI_1136_Lab6.docx File : 1550
Number Of Numeric Chars In..CSCI_1136_Lab6.docx File : 208
Number Of Words In..CSCI_1136_Lab6.docx File : 66
Number Of Lines In..CSCI_1136_Lab6.docx File : 48
0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.