简体   繁体   English

如何不计算文本文件中的控制字符

[英]How to NOT count control characters in a text file

I am having trouble understanding how to NOT count control characters in a text file. 我无法理解如何不计算文本文件中的控制字符。 My program does everything but skip control characters \\n \\r : contents of file: Ok upon further tries I am closer. 我的程序除了跳过控制字符外都做了所有事情\\n \\r :文件内容:好的,经过进一步尝试,我更接近。 If I change: 如果我改变:

 while (input.hasNext()) { 
          String line = input.nextLine();
          lineCount++;
          wordCount += countWords(line);
          charcount += line.length(); 
 to
 while (input.hasNext()) {
          String line = input.next();
          lineCount++;
          wordCount += countWords(line);
          charCount += line.replace("\n", "").replace("\r", "").length();

the chars are counted but it messes up the lines. 字数被计算,但它弄乱了线条。 If I add the input.nextLine it messes up the chars. 如果我添加input.nextLine,它就会混乱。 contents of text file: 文本文件的内容:
cat
sad dog 伤心的狗
dog wag 狗摇摆

import java.io.*;
import java.util.*;

public class Character_count {

 public static void main(String args[]) throws Exception {

     java.io.File file = new java.io.File("textFile.txt");

        // Create a Scanner for the file
        Scanner input = new Scanner(file);
        int charcount = 0;
        int wordCount = 0;
        int lineCount = 0;

        while (input.hasNext()) {
          String line = input.nextLine();
          lineCount++;
          wordCount += countWords(line);
          charcount += line.length();
        }


        System.out.println("The file " + file + " has ");
        System.out.println(charcount + " characters");
        System.out.println(wordCount + " words");
        System.out.println(lineCount + " lines");

        }

    private static int countWords(String s) {
        Scanner input = new Scanner(s);
            int count = 0;

        while (input.hasNext()) {
            input.next(); 
        count++;
        }
        return count;

  }
}

You can achieve that with your Scanner by using the useDelimiter method: 您可以使用useDelimiter方法使用您的Scanner实现此useDelimiter

Scanner input = new Scanner(new File("textFile.txt"));
input.useDelimiter("\r\n");

And continue with your code as usual, should work. 像往常一样继续你的代码,应该工作。

Also (and very important ) if you check hasNext() then use next() , and if you check hasNextLine() use nextLine() ! 另外(并且非常重要 )如果你检查hasNext()然后使用next() ,如果你检查hasNextLine()使用nextLine() Don't mix-and-match as it will cause (or already causing) issues down the line. 不要混淆,因为它会导致(或已经造成)问题。

You could replace all the \\n and \\r with empty String like this: 你可以用空字符串替换所有\\n\\r \\n ,如下所示:

line = line.replaceAll("\\r?\\n", "")

Now you can do the counts and it would not take into account any \\n or \\r . 现在你可以进行计数,它不会考虑任何\\n\\r \\n

You could alternatively do (Without using regex): 你也可以这样做(不使用正则表达式):

line = line.replace("\n", "").replace("\r", "")

Hello you should use '\\s' in the regular expression that represents white spaces 您好,您应该在表示空格的正则表达式中使用'\\ s'

\\s stands for "whitespace character". \\ s代表“空白字符”。 Again, which characters this actually includes, depends on the regex flavor. 同样,这实际包含哪些字符取决于正则表达式的味道。 In all flavors discussed in this tutorial, it includes [ \\t\\r\\n\\f]. 在本教程中讨论的所有风格中,它包括[\\ t \\ r \\ n \\ f]。 That is: \\s matches a space, a tab, a line break, or a form feed.( http://www.regular-expressions.info/shorthand.html ) 即:\\ s匹配空格,制表符,换行符或换页符。( http://www.regular-expressions.info/shorthand.html

so here how you use it 所以在这里你如何使用它

Scanner scanner = new Scanner(path.toFile(),"UTF-8");
    String content = scanner.useDelimiter("\\A").next();
    System.out.println(content);

    Pattern patternLine = Pattern.compile("\\r?\\n");
    Matcher matcherLine = patternLine.matcher(content);
    int numberLines = 1;
    while (matcherLine.find())
        numberLines++;


    Pattern pattern = Pattern.compile("\\s");
    Matcher matcherEliminateWhiteSpace = pattern.matcher(content);
    String contentWithoutWhiteSpace=matcherEliminateWhiteSpace.replaceAll("");


    // it will count only ASCII Charachter a->z A->Z 0->9 _'underscore'
    Pattern patternCharachter=Pattern.compile("\\w");
    Matcher matcherCharachterAscii= patternCharachter.matcher(contentWithoutWhiteSpace);


    int numberCharachtersAscii = 0;
    while (matcherCharachterAscii.find())
        numberCharachtersAscii++;
    //it will count UTF-8 charachters it will count all charachter no matter what script it is like français عربي and punctuation
Pattern patternUniversal= Pattern.compile(".");
    Matcher matcherUniversal= patternUniversal.matcher(contentWithoutWhiteSpace);
    int numberUniversalCharachter=0;
    while(matcherUniversal.find())
    numberUniversalCharachter++;
    System.out
            .println("******************************************************");
    System.out.println(contentWithoutWhiteSpace);
    System.out.println(numberLines);
    System.out.println(numberCharachtersAscii);
    System.out.println(numberUniversalCharachter);
  • EDIT 编辑

here is a simple modification that will make it work 这是一个简单的修改,将使其工作

        while (scanner.hasNext()) {
          String line = scanner.nextLine();
          lineCount++;
          wordCount += countWords(line);
          charcount += word.replaceAll("\\s", "").length();
          System.out.println(charcount);
          i++;
    }

\\\\s stands for white spaces[tab cariagReturn lineFeed space formFeed ] \\\\ s代表白色空间[tab cariagReturn lineFeed space formFeed]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM