简体   繁体   English

用Java计算文本文件中的单词

[英]Counting words from a text-file in Java

I'm writing a program that'll scan a text file in, and count the number of words in it.我正在编写一个程序,它将扫描一个文本文件,并计算其中的单词数。 The definition for a word for the assignment is: 'A word is a non-empty string consisting of only of letters (a,. . . ,z,A,. . . ,Z), surrounded by blanks, punctuation, hyphenation, line start, or line end.分配的单词的定义是:'单词是仅由字母 (a,...,z,A,...,Z) 组成的非空字符串,由空格、标点符号、连字符包围,行开始,或行结束。 '. '。

I'm very novice at java programming, and so far i've managed to write this instancemethod, which presumably should work.我是java编程的新手,到目前为止我已经设法编写了这个instancemethod,它大概应该可以工作。 But it doesn't.但事实并非如此。

public int wordCount() {
    int countWord = 0;
    String line = "";
    try {
        File file = new File("testtext01.txt");
        Scanner input = new Scanner(file);

        while (input.hasNext()) {
            line = line + input.next()+" ";
            input.next();
        }
        input.close();
        String[] tokens = line.split("[^a-zA-Z]+");
        for (int i=0; i<tokens.length; i++){
            countWord++;
        }
        return countWord;

    } catch (Exception ex) {
        ex.printStackTrace();
    }
    return -1;
}

Quoting from Counting words in text file?引用文本文件中的单词计数?

    int wordCount = 0;

    while (input.hasNextLine()){

       String nextLine = input.nextLine();
       Scanner word = new Scanner(nextline);

       while(word.hasNext()){
          wordCount++;    
          word.next();
       }
       word.close();
    }
    input.close();

The only usable word separators in your file are spaces and hyphens.文件中唯一可用的单词分隔符是空格和连字符。 You can use regex and the split() method.您可以使用regexsplit()方法。

int num_words = line.split("[\\s\\-]").length; //stores number of words
System.out.print("Number of words in file is "+num_words);

REGEX (Regular Expression): REGEX(正则表达式):

\\\\s splits the String at white spaces/line breaks and \\\\- at hyphens. \\\\s在空格/换行符和\\\\-在连字符处拆分字符串。 So wherever there is a space, line break or hyphen, the sentence will be split.因此,只要有空格、换行符或连字符,句子就会被拆分。 The words extracted are copied into and returned as an array whose length is the number of words in your file.提取的单词被复制到一个数组中并返回,数组的length是文件中的单词数。

you can use java regular expression. 
You can read http://docs.oracle.com/javase/tutorial/essential/regex/groups.html to know about group



    public int wordCount(){

        String patternToMatch = "([a-zA-z]+)";
        int countWord = 0;
        try {
        Pattern pattern =  Pattern.compile(patternToMatch);
        File file = new File("abc.txt");
        Scanner sc = new Scanner(file);
        while(sc.hasNextLine()){
            Matcher matcher = pattern.matcher(sc.nextLine());
             while(matcher.find()){
                 countWord++;
             }
        }
        sc.close();
}catch(Exception e){
          e.printStackTrace();
        }
        return countWord > 0 ? countWord : -1;
    }
void run(String path)
throws Exception
{
    try (BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(path), "UTF-8")))
    {
        int result = 0;

        while (true)
        {
            String line = reader.readLine();

            if (line == null)
            {
                break;
            }

            result += countWords(line);
        }

        System.out.println("Words in text: " + result);
    }
}

final Pattern pattern = Pattern.compile("[A-Za-z]+");

int countWords(String text)
{
    Matcher matcher = pattern.matcher(text);

    int result = 0;

    while (matcher.find())
    {
        ++result;

        System.out.println("Matcher found [" + matcher.group() + "]");
    }

    System.out.println("Words in line: " + result);

    return result;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM