简体   繁体   中英

How to NOT count control characters in a text file

I am having trouble understanding how to NOT count control characters in a text file. My program does everything but skip control characters \\n \\r : contents of file: Ok upon further tries I am closer. If I change:

 while (input.hasNext()) { 
          String line = input.nextLine();
          lineCount++;
          wordCount += countWords(line);
          charcount += line.length(); 
 to
 while (input.hasNext()) {
          String line = input.next();
          lineCount++;
          wordCount += countWords(line);
          charCount += line.replace("\n", "").replace("\r", "").length();

the chars are counted but it messes up the lines. If I add the input.nextLine it messes up the chars. contents of text file:
cat
sad dog
dog wag

import java.io.*;
import java.util.*;

public class Character_count {

 public static void main(String args[]) throws Exception {

     java.io.File file = new java.io.File("textFile.txt");

        // Create a Scanner for the file
        Scanner input = new Scanner(file);
        int charcount = 0;
        int wordCount = 0;
        int lineCount = 0;

        while (input.hasNext()) {
          String line = input.nextLine();
          lineCount++;
          wordCount += countWords(line);
          charcount += line.length();
        }


        System.out.println("The file " + file + " has ");
        System.out.println(charcount + " characters");
        System.out.println(wordCount + " words");
        System.out.println(lineCount + " lines");

        }

    private static int countWords(String s) {
        Scanner input = new Scanner(s);
            int count = 0;

        while (input.hasNext()) {
            input.next(); 
        count++;
        }
        return count;

  }
}

You can achieve that with your Scanner by using the useDelimiter method:

Scanner input = new Scanner(new File("textFile.txt"));
input.useDelimiter("\r\n");

And continue with your code as usual, should work.

Also (and very important ) if you check hasNext() then use next() , and if you check hasNextLine() use nextLine() ! Don't mix-and-match as it will cause (or already causing) issues down the line.

You could replace all the \\n and \\r with empty String like this:

line = line.replaceAll("\\r?\\n", "")

Now you can do the counts and it would not take into account any \\n or \\r .

You could alternatively do (Without using regex):

line = line.replace("\n", "").replace("\r", "")

Hello you should use '\\s' in the regular expression that represents white spaces

\\s stands for "whitespace character". Again, which characters this actually includes, depends on the regex flavor. In all flavors discussed in this tutorial, it includes [ \\t\\r\\n\\f]. That is: \\s matches a space, a tab, a line break, or a form feed.( http://www.regular-expressions.info/shorthand.html )

so here how you use it

Scanner scanner = new Scanner(path.toFile(),"UTF-8");
    String content = scanner.useDelimiter("\\A").next();
    System.out.println(content);

    Pattern patternLine = Pattern.compile("\\r?\\n");
    Matcher matcherLine = patternLine.matcher(content);
    int numberLines = 1;
    while (matcherLine.find())
        numberLines++;


    Pattern pattern = Pattern.compile("\\s");
    Matcher matcherEliminateWhiteSpace = pattern.matcher(content);
    String contentWithoutWhiteSpace=matcherEliminateWhiteSpace.replaceAll("");


    // it will count only ASCII Charachter a->z A->Z 0->9 _'underscore'
    Pattern patternCharachter=Pattern.compile("\\w");
    Matcher matcherCharachterAscii= patternCharachter.matcher(contentWithoutWhiteSpace);


    int numberCharachtersAscii = 0;
    while (matcherCharachterAscii.find())
        numberCharachtersAscii++;
    //it will count UTF-8 charachters it will count all charachter no matter what script it is like français عربي and punctuation
Pattern patternUniversal= Pattern.compile(".");
    Matcher matcherUniversal= patternUniversal.matcher(contentWithoutWhiteSpace);
    int numberUniversalCharachter=0;
    while(matcherUniversal.find())
    numberUniversalCharachter++;
    System.out
            .println("******************************************************");
    System.out.println(contentWithoutWhiteSpace);
    System.out.println(numberLines);
    System.out.println(numberCharachtersAscii);
    System.out.println(numberUniversalCharachter);
  • EDIT

here is a simple modification that will make it work

        while (scanner.hasNext()) {
          String line = scanner.nextLine();
          lineCount++;
          wordCount += countWords(line);
          charcount += word.replaceAll("\\s", "").length();
          System.out.println(charcount);
          i++;
    }

\\\\s stands for white spaces[tab cariagReturn lineFeed space formFeed ]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM