简体   繁体   中英

What data structure or design pattern can I use to resolve this issue

I have the following design issue that I hope to get your help to resolve. Below is a simplistic look at what the code looks like

class DataProcessor{
    public List<Record> processData(DataFile file){ 
        List<Record> recordsList = new ArrayList<Record>();
        for(Line line : file.getLines()){
            String processedData = processData(line);
            recordsList.add(new Record(processedData));
        }
    }

    private String processData(String rawLine){
        //code to process line
    }
}
class DatabaseManager{
    saveRecords(List<Record> recordsList){
        //code to insert records objects in database
    }
}
class Manager{
    public static void main(String[] args){

        DatabaseManager dbManager = new DatabaseManager("e:\\databasefile.db");
        DataFile dataFile = new DataFile("e:\\hugeRawFile.csv");
        DataProcessor dataProcessor = new DataProcessor();
        dbManager.saveRecords(dataProcessor.processData(dataFile));
    }
}

As you can see, "processData" method of class "DataProcessor" takes DataFile object, processes the whole file, create Record object for each line and then it returns a list of "Record" objects.

My problem with "processData" method: When the raw file is really huge, "List of Record" objects takes a lot of memory and sometimes the program fails. I need to change the current desgin so that the memory usage is minimized. "DataProcessor" should not have direct access to "DatabaseManager". I was thinking of passing a queue to "processData" method, where one thread run "processData" method to insert "Record" object in the queue, while another thread remove "Record" object from the queue and insert it in database. But I'm not sure about the performance issues with this.

Put the responsibility of driving the process into the most constrained resource (in your case the DataProcessor ) - this will make sure the constraints are best obeyed rather than forced to the breaking point.

Note : don't even think of multithreading, it is not going to do you any good for processing files . Threads will be a solution if your data comes over the wire, when you don't know when your next data chunk is going to arrive ad perhaps you have better things to do with your CPU time than to wait "until cows come home to roost" (grin). But with files? You know the job has a start and an end, so get on with it as fast as possible.

class DataProcessor{
    public List<Record> processData(DataFile file){ 
        List<Record> recordsList = new ArrayList<Record>();
        for(Line line : file.getLines()){
            String processedData = processData(line);
            recordsList.add(new Record(processedData));
        }
    }

    private String processData(String rawLine){
        //code to process line
    }

    public void processAndSaveData(DataFile dataFile, DatabaseManager db) {
      int maxBuffSize=1024;
      ArrayList<Record> buff=new ArrayList<Record>(maxBuffSize);
      for(Line line : file.getLines()){
        String processedData = processData(line);
        buff.add(new Record(processedData));
        if(buff.size()==maxBuffSize) {
          db.saveRecords(buff);
          buff.clear();
        }
      }
      // some may be still unsaved here, less that maxBuffSize
      if(buff.size()>0) {
        db.saveRecords(buff);
        // help the CG, let it recycle the records
        // without needing to look "is buff still reacheable"? 
        buff.clear();
      }
   }
}

class Manager{
    public static void main(String[] args){

      DatabaseManager dbManager = new DatabaseManager("e:\\databasefile.db");
      DataFile dataFile = new DataFile("e:\\hugeRawFile.csv");
      DataProcessor dataProcessor = new DataProcessor();

      // So... do we need another stupid manager to tell us what to do?
      // dbManager.saveRecords(dataProcessor.processData(dataFile));

      // Hell, no, the most constrained resource knows better
      // how to deal with the job!
      dataProcessor.processAndSaveData(dataFile, dbManager);
    }
}

[edit] Addressing the "but we settled on what and how, and now you are coming to tell us we need to write extra code?"

Build an AbstractProcessor class and ask your mates just to derive from it.

class AbstractProcessor {
  // sorry, need to be protected to be able to call it
  abstract protected Record processData(String rawLine);

  abstract protected Class<? extends Record> getRecordClass();

  public void processAndSaveData(DataFile dataFile, DatabaseManager db) {
    Class<? extends Record> recordType=this.getRecordClass();
    if(recordType.equals(MyRecord1.class) {
      // buffered read and save MyRecord1 types specifically
    }
    else if(recordType.equals(YourRecord.class)) {
      // buffered read and save YourRecord types specifically
    }
    // etc...
  }
}

Now, all they need to do is to "code" extends AbstractProcessor and make their processData(String) protected and write a trivial method declaring its record type (may as well be an enum). It's not like you ask them a huge effort and makes what would have been a costly (or even impossible, for a TB input file) operation a "as fast as possible one".

You should be able to use streaming to do this in one thread, one record at a time in memory. The implementation depends on the technology your DatabaseManager is using.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM