简体   繁体   English

更高效还是更现代? 使用 Java 读取和排序文本文件

[英]More efficient or more modern? Reading in & Sorting A Text File With Java

I've been trying to upgrade my Java skills to use more of Java 5 & Java 6. I've been playing around with some programming exercises.我一直在尝试升级我的 Java 技能以使用更多的 Java 5 和 Java 6。我一直在玩一些编程练习。 I was asked to read in a paragraph from a text file and output a sorted (descending) list of words and output the count of each word.我被要求从文本文件中读取一段,output 是单词的排序(降序)列表,output 是每个单词的计数。

My code is below.我的代码如下。

My questions are:我的问题是:

  1. Is my file input routine the most respectful of JVM resources?我的文件输入例程是最尊重 JVM 资源的吗?

  2. Is it possible to cut steps out in regards to reading the file contents and getting the content into a collection that can make a sorted list of words?是否可以减少读取文件内容并将内容放入可以制作单词排序列表的集合的步骤?

  3. Am I using the Collection classes and interface the most efficient way I can?我是否以最有效的方式使用 Collection 类和接口?

Thanks much for any opinions.非常感谢任何意见。 I'm just trying to have some fun and improve my programming skills.我只是想找点乐子并提高我的编程技能。

import java.io.*;
import  java.util.*;

public class Sort
{
    public static void main(String[] args)
    {
        String   sUnsorted       = null;
        String[] saSplit         = null;

        int iCurrentWordCount    = 1;
        String currentword       = null;
        String pastword          = "";

        // Read the text file into a string
        sUnsorted = readIn("input1.txt");

        // Parse the String by white space into String array of single words
        saSplit   = sUnsorted.split("\\s+");

        // Sort the String array in descending order
        java.util.Arrays.sort(saSplit, Collections.reverseOrder());


        // Count the occurences of each word in the String array
        for (int i = 0; i < saSplit.length; i++ )
        {

            currentword = saSplit[i];

            // If this word was seen before, increase the count & print the
            // word to stdout
            if ( currentword.equals(pastword) )
            {
                iCurrentWordCount ++;
                System.out.println(currentword);
            }
            // Output the count of the LAST word to stdout,
            // Reset our counter
            else if (!currentword.equals(pastword))
            {

                if ( !pastword.equals("") )
                {

                    System.out.println("Word Count for " + pastword + ": " + iCurrentWordCount);

                }


                System.out.println(currentword );
                iCurrentWordCount = 1;

            }

            pastword = currentword;  
        }// end for loop

       // Print out the count for the last word processed
       System.out.println("Word Count for " + currentword + ": " + iCurrentWordCount);



    }// end funciton main()


    // Read The Input File Into A String      
    public static String readIn(String infile)
    {
        String result = " ";

        try
        {
            FileInputStream file = new FileInputStream (infile);
            DataInputStream in   = new DataInputStream (file);
            byte[] b             = new byte[ in.available() ];

            in.readFully (b);
            in.close ();

            result = new String (b, 0, b.length, "US-ASCII");

        }
        catch ( Exception e )
        {
            e.printStackTrace();
        }

        return result;
    }// end funciton readIn()

}// end class Sort()

/////////////////////////////////////////////////
//  Updated Copy 1, Based On The Useful Comments
//////////////////////////////////////////////////

import java.io.*;
import java.util.*;

public class Sort2
{
    public static void main(String[] args) throws Exception
    {
        // Scanner will tokenize on white space, like we need
        Scanner scanner               = new Scanner(new FileInputStream("input1.txt"));
        ArrayList <String> wordlist   = new  ArrayList<String>();
        String currentword            = null;   
        String pastword               = null;
        int iCurrentWordCount         = 1;       

        while (scanner.hasNext())
            wordlist.add(scanner.next() );

        // Sort in descending natural order
        Collections.sort(wordlist);
        Collections.reverse(wordlist);

        for ( String temp : wordlist )
        {
            currentword = temp;

            // If this word was seen before, increase the count & print the
            // word to stdout
            if ( currentword.equals(pastword) )
            {
                iCurrentWordCount ++;
                System.out.println(currentword);
            }
            // Output the count of the LAST word to stdout,
            // Reset our counter
            else //if (!currentword.equals(pastword))
            {
                if ( pastword != null )
                    System.out.println("Count for " + pastword + ": " +  
                                                            CurrentWordCount);   

                System.out.println(currentword );
                iCurrentWordCount = 1;    
            }

            pastword = currentword;  
        }// end for loop

        System.out.println("Count for " + currentword + ": " + iCurrentWordCount);

    }// end funciton main()


}// end class Sort2
  1. There are more idiomatic ways of reading in all the words in a file in Java.在 Java 中,有更多惯用的方式来读取文件中的所有单词。 BreakIterator is a better way of reading in words from an input. BreakIterator是从输入中读取单词的更好方法。

  2. Use List<String> instead of Array in almost all cases.几乎在所有情况下都使用List<String>而不是Array Array isn't technically part of the Collection API and isn't as easy to replace implementations as List , Set and Map are. Array 在技术上不是Collection API的一部分,并且不像ListSetMap那样容易替换实现。

  3. You should use a Map<String,AtomicInteger> to do your word counting instead of walking the Array over and over.您应该使用Map<String,AtomicInteger>来计算字数,而不是一遍又一遍地遍历Array AtomicInteger is mutable unlike Integer so you can just incrementAndGet() in a single operation that just happens to be thread safe.Integer不同, AtomicInteger是可变的,因此您可以在恰好是线程安全的单个操作中进行incrementAndGet() A SortedMap implementation would give you the words in order with their counts as well. SortedMap实现也会按顺序为您提供单词及其计数。

  4. Make as many variables, even local ones final as possible.使尽可能多的变量,即使是局部变量也尽可能final and declare them right before you use them, not at the top where their intended scope will get lost.并在使用它们之前声明它们,而不是在其预期的 scope 将丢失的顶部。

  5. You should almost always use a BufferedReader or BufferedStream with an appropriate buffer size equal to a multiple of your disk block size when doing disk IO.在执行磁盘 IO 时,您几乎应该始终使用具有适当缓冲区大小的BufferedReaderBufferedStream等于磁盘块大小的倍数。

That said, don't concern yourself with micro optimizations until you have "correct" behavior.也就是说,在你有“正确”的行为之前,不要关心微优化。

  • the SortedMap type might be efficient enough memory-wise to use here in the form SortedMap<String,Integer> (especially if the word counts are likely to be under 128) SortedMap类型在内存方面可能足够高效,可以以SortedMap<String,Integer>形式在此处使用(特别是如果字数可能低于 128)
  • you can provide customer delimiters to the Scanner type for breaking streams您可以为Scanner类型提供客户分隔符以中断流

Depending on how you want to treat the data, you might also want to strip punctuation or go for more advanced word isolation with a break iterator - see the java.text package or the ICU project.根据您要如何处理数据,您可能还需要去除标点符号或 go 以使用中断迭代器进行更高级的字隔离 - 请参阅java.text ZEFE90A8E604A7C840E88D03AD67 项目。

Also - I recommend declaring variables when you first assign them and stop assigning unwanted null values.另外 - 我建议您在首次分配变量时声明变量并停止分配不需要的 null 值。


To elaborate, you can count words in a map like this:详细地说,您可以像这样计算 map 中的单词:

void increment(Map<String, Integer> wordCountMap, String word) {
  Integer count = wordCountMap.get(word);
  wordCountMap.put(word, count == null ? 1 : ++count);
}

Due to the immutability of Integer and the behaviour of autoboxing, this might result in excessive object instantiation for large data sets.由于Integer的不变性和自动装箱的行为,这可能会导致大型数据集的过度 object 实例化 An alternative would be (as others suggest) to use a mutable int wrapper (of which AtomicInteger is a form.)另一种方法是(正如其他人建议的那样)使用可变的int包装器(其中AtomicInteger是一种形式。)

Can you use Guava for your homework assignment?你可以用番石榴做家庭作业吗? Multiset handles the counting. Multiset处理计数。 Specifically, LinkedHashMultiset might be useful.具体来说, LinkedHashMultiset可能很有用。

Some other things you might find interesting:您可能会发现其他一些有趣的事情:

To read the file you could use a BufferedReader (if it's text only).要读取文件,您可以使用BufferedReader (如果它只是文本)。

This:这个:

for (int i = 0; i < saSplit.length; i++ ){
    currentword = saSplit[i];
    [...]
}

Could be done using a extended for-loop (the Java-foreach), like shown here .可以使用扩展的 for 循环(Java-foreach)来完成,如下 所示

if ( currentword.equals(pastword) ){
    [...]
} else if (!currentword.equals(pastword)) {
    [...]
}

In your case, you can simply use a single else so the condition isn't checked again (because if the words aren't the same, they can only be different).在您的情况下,您可以简单地使用一个else ,这样就不会再次检查条件(因为如果单词不相同,它们只能是不同的)。

if ( !pastword.equals("") )

I think using length is faster here:我认为在这里使用length更快:

if (!pastword.length == 0)

Input method:输入法:

Make it easier on yourself and deal directly with characters instead of bytes.让自己更轻松,直接处理字符而不是字节。 For example, you could use a FileReader and possibly wrap it inside a BufferedReader .例如,您可以使用FileReader并可能将其包装在BufferedReader中。 At the least, I'd suggest looking at InputStreamReader , as the implementation to change from bytes to characters is already done for you.至少,我建议查看InputStreamReader ,因为已经为您完成了从字节更改为字符的实现。 My preference would be using Scanner .我的偏好是使用Scanner

I would prefer returning null or throwing an exception from your readIn() method.我宁愿返回null或从您的readIn()方法中抛出异常。 Exceptions should not be used for flow control, but, here, you're sending an important message back to the caller: the file that you provided was not valid.异常不应用于流控制,但是,在这里,您向调用者发送了一条重要消息:您提供的文件无效。 Which brings me to another point: consider whether you truly want to catch all exceptions, or just ones of certain types.这让我想到了另一点:考虑您是否真的想要捕获所有异常,或者只是某些类型的异常。 You'll have to handle all checked exceptions, but you may want to handle them differently.您必须处理所有已检查的异常,但您可能希望以不同的方式处理它们。

Collections: Collections:

You're really not use Collections classes, you're using an array.你真的没有使用 Collections 类,你使用的是数组。 Your implementation seems fine, but...你的实现看起来不错,但是......

There are certainly many ways of handling this problem.当然有很多方法可以处理这个问题。 Your method -- sorting then comparing to last -- is O(nlogn) on average.您的方法 - 排序然后与最后一个比较 - 平均为 O(nlogn) 。 That's certainly not bad.这当然不坏。 Look at a way of using a Map implementation (such as HashMap ) to store the data you need while only traversing the text in O(n) ( HashMap 's get() and put() -- and presumably contains() -- methods are O(1)).查看使用Map实现(例如HashMap )存储所需数据的方法,同时仅在 O(n) 中遍历文本( HashMap和大概contains() put() get()方法是 O(1))。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM