简体   繁体   中英

Strange behaviour of String.length()

I have class with main:

public class Main {

// args[0] - is path to file with first and last words
// args[1] - is path to file with dictionary 
public static void main(String[] args) {
    try {
        List<String> firstLastWords = FileParser.getWords(args[0]);
            System.out.println(firstLastWords);
        System.out.println(firstLastWords.get(0).length());

    } catch (IOException ex) {
        ex.printStackTrace();
    }
}
}

and I have FileParser:

public class FileParser {

    public FileParser() {
    }

    final static Charset ENCODING = StandardCharsets.UTF_8;


    public static List<String> getWords(String filePath) throws IOException {
        List<String> list = new ArrayList<String>();
        Path path = Paths.get(filePath);

        try (BufferedReader reader = Files.newBufferedReader(path, ENCODING)) {
            String line = null;
            while ((line = reader.readLine()) != null) {

                String line1 = line.replaceAll("\\s+","");
                if (!line1.equals("") && !line1.equals(" ") ){
                    list.add(line1);
                }
            }
            reader.close();
        }
        return list;
    }   
}

args[0] is the path to txt file with just 2 words. So if file contains:

тор
кит

programm returns:

[тор, кит]
4

If file contains:

т
тор
кит

programm returns:

[т, тор, кит]
2


even if file contains:
//jump to next line
тор
кит

programm returns:

[, тор, кит]
1

where digit - is length of the first string in the list.

So the question is why it counts one more symbol?

Thanks all.

This symbol as said @Bill is BOM ( http://en.wikipedia.org/wiki/Byte_order_mark ) and reside at the beginning of a text file. So i found this symbol by this line:

System.out.println(((int)firstLastWords.get(0).charAt(0)));

it gave me 65279

then i just changed this line:
String line1 = line.replaceAll("\\\\s+",""); to this

String line1 = line.replaceAll("\uFEFF","");

Cyrillic characters are difficult to capture using Regex, eg \\p{Graph} does not work, although they are clearly visible characters. Anyways, that is besides the OP question.

The actual problem is likely due to other non-visible characters, likely control characters present. Try following regex to remove more: replaceAll("(\\\\s|\\\\p{Cntrl})+","") . You can play around with the Regex to further extend that to other cases.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM