[英]Strange behaviour of String.length()
I have class with main: 我有主课:
public class Main {
// args[0] - is path to file with first and last words
// args[1] - is path to file with dictionary
public static void main(String[] args) {
try {
List<String> firstLastWords = FileParser.getWords(args[0]);
System.out.println(firstLastWords);
System.out.println(firstLastWords.get(0).length());
} catch (IOException ex) {
ex.printStackTrace();
}
}
}
and I have FileParser: 我有FileParser:
public class FileParser {
public FileParser() {
}
final static Charset ENCODING = StandardCharsets.UTF_8;
public static List<String> getWords(String filePath) throws IOException {
List<String> list = new ArrayList<String>();
Path path = Paths.get(filePath);
try (BufferedReader reader = Files.newBufferedReader(path, ENCODING)) {
String line = null;
while ((line = reader.readLine()) != null) {
String line1 = line.replaceAll("\\s+","");
if (!line1.equals("") && !line1.equals(" ") ){
list.add(line1);
}
}
reader.close();
}
return list;
}
}
args[0]
is the path to txt file with just 2 words. args[0]
是只有2个单词的txt文件的路径。 So if file contains: 所以,如果文件包含:
тор
кит
programm returns: 程序返回:
[тор, кит]
4
If file contains: 如果文件包含:
т
тор
кит
programm returns: 程序返回:
[т, тор, кит]
2
even if file contains: 即使文件包含:
//jump to next line //跳到下一行
тор тор
кит кит
programm returns: 程序返回:
[, тор, кит]
1
where digit - is length of the first string in the list. 其中digit - 是列表中第一个字符串的长度。
So the question is why it counts one more symbol? 所以问题是为什么它又算一个符号呢?
Thanks all. 谢谢大家。
This symbol as said @Bill is BOM ( http://en.wikipedia.org/wiki/Byte_order_mark ) and reside at the beginning of a text file. 这个符号表示@Bill是BOM( http://en.wikipedia.org/wiki/Byte_order_mark )并且位于文本文件的开头。 So i found this symbol by this line:
所以我通过这一行找到了这个符号:
System.out.println(((int)firstLastWords.get(0).charAt(0)));
it gave me 65279 它给了我65279
then i just changed this line: 然后我改变了这一行:
String line1 = line.replaceAll("\\\\s+","");
to this 对此
String line1 = line.replaceAll("\uFEFF","");
Cyrillic characters are difficult to capture using Regex, eg \\p{Graph}
does not work, although they are clearly visible characters. 使用正则表达式很难捕获西里尔字符,例如
\\p{Graph}
不起作用,尽管它们是清晰可见的字符。 Anyways, that is besides the OP question. 无论如何,这是OP问题之外的问题。
The actual problem is likely due to other non-visible characters, likely control characters present. 实际问题可能是由于其他不可见的字符,可能存在控制字符。 Try following regex to remove more:
replaceAll("(\\\\s|\\\\p{Cntrl})+","")
. 尝试使用以下正则表达式删除更多:
replaceAll("(\\\\s|\\\\p{Cntrl})+","")
。 You can play around with the Regex to further extend that to other cases. 您可以使用正则表达式进一步扩展到其他情况。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.