简体   繁体   中英

Strange behaviour in console program

I have to write a small console program for a developer internship interview and something big and very hard to find is going wrong. I'm supposed to write a program that checks a directory full of binary .dat files for duplicate files.

What I did: I input a file using stdin from main.cpp and if the directory exists I pass the path on to my fileChecker function which then generates MD5 hashes for all the files in the given directory and then creates a QHash with the file names as key and the hashes as values. I then try to iterate over the QHash using a java-style iterator. When I run the program it crashes completely and I have to choose debug or end program which makes it impossible for me to figure out what's going wrong as QT's debugger doesn't output anything.

My guess is that something is going wrong with my getDuplicates function in fileChecker.cpp as i've never used java-style itterators before to itterate over a QHash. i'm trying to take the first key-value pair and store it in two variables. Then I remove those values from the QHash and try to itterate over the remainder of the QHash using an itterator inside the previous itterator. If anyone has any idea what i'm doing wrong please let me know asap as I need to have this done before monday to get an interview... the code for fileChecker.h and fileChecker.cpp are below please let me know if there's anything more I can add. Thanks

my code:

main.cpp:

#include "filechecker.h"
#include <QDir>
#include <QTextStream>
#include <QString>
#include <QStringList>

QTextStream in(stdin);
QTextStream out(stdout);

int main() {
    QDir* dir;
    FileChecker checker;
    QString dirPath;
    QStringList duplicateList;

    out << "Please enter directory path NOTE:  use / as directory separator regardless of operating system" << endl;
    dirPath = in.readLine();


    dir->setPath(dirPath);
    if(dir->exists()) {
        checker.processDirectory(dir);
        duplicateList = checker.getDuplicateList();
    }
    else if(!(dir->exists()))
        out << "Directory does not exist" << endl;

    foreach(QString str, duplicateList){
        out << str << endl;
    }

    return 0;
}

fileChecker.h:

#ifndef FILECHECKER_H
#define FILECHECKER_H
#include <QString>
#include <QByteArray>
#include <QHash>
#include <QCryptographicHash>
#include <QStringList>
#include <QDir>

class FileChecker
{
public:
    FileChecker();
    void processDirectory(QDir* dir);
    QByteArray generateChecksum(QFile* file);
    QStringList getDuplicateList();
private:
    QByteArray generateChecksum(QString fileName);
    QHash<QString, QByteArray> m_hash;
};

#endif // FILECHECKER_H



fileChecker.cpp:

#include "filechecker.h"

FileChecker::FileChecker() {
}

void FileChecker::processDirectory(QDir* dir) {

    dir->setFilter(QDir::Files);
    QStringList fileList = dir->entryList();


    for (int i = 0; i < fileList.size(); i++) {
        bool possibleDuplicatesFound = false;
        QString testName = fileList.at((i));
        QFile* testFile;
        testFile->setFileName(testName);


        foreach(QString s, fileList) {
            QFile* possibleDuplicate;

            possibleDuplicate->setFileName(s);
            if(testFile->size() == possibleDuplicate->size() && testFile->fileName() != possibleDuplicate->fileName()) {
                QByteArray md5HashPd = generateChecksum(possibleDuplicate);
                m_hash.insert(possibleDuplicate->fileName(), md5HashPd);
                possibleDuplicatesFound = true;
                fileList.replaceInStrings(possibleDuplicate->fileName(), "");
            }
            QByteArray md5Hasht = generateChecksum(testFile);
            fileList.replaceInStrings(testFile->fileName(), "");
            possibleDuplicatesFound = false;
        }

    }
}


QByteArray FileChecker::generateChecksum(QFile* file) {

    if(file->open(QIODevice::ReadOnly)) {
        QCryptographicHash cHash(QCryptographicHash::Md5);
        cHash.addData(file->readAll());
        QByteArray checksum = cHash.result();
        return checksum;
    }
}

QStringList FileChecker::getDuplicateList() {
    QStringList tempList;
    QString tempStr;
    QString currentKey;
    QByteArray currentValue;
    QMutableHashIterator<QString, QByteArray> i(m_hash);
    do {
    while (i.hasNext()){
        i.next();
        currentKey = i.key();
        currentValue = i.value();
        tempStr.append("%1 ").arg(currentKey);

        if (i.value() == currentValue) {
                tempStr.append("and %1").arg(i.key());
                i.remove();
            }
        tempList.append(tempStr);
        tempStr.clear();
    }
    } while (m_hash.size() > 0);

    return tempList;
}

Aside from your sad Qt memory management problem, you really don't have to calculate md5 sums of all files.

Just for groups of files of equal size :)

Files with a unique size can be left out. I wouldn't even call this an optimization but simply not doing a potentially absurd amount of unnecessary extra work :)

All Qt Java-style iterators come in "regular" (const) and mutable versions (where it is safe to modify the object you are iterating). See QMutableHashIterator . You're modifying a const iterator; thus, it crashes.

While you're at it, look at the findNext function the iterator provides. Using this function eliminates the need for your second iterator.

Just add i.next() as following.

do {
    while (i.hasNext()) {
        i.next();
        currentKey = i.key();
        currentValue = i.value();
        tempStr.append(currentKey);
        m_hash.remove(currentKey);
        QHashIterator<QString, QByteArray> j(m_hash);
        while (j.hasNext()) {
            if (j.value() == currentValue) {
                tempStr.append(" and %1").arg(j.key());
                m_hash.remove(j.key());
            }
        }
        tempList.append(tempStr);
        tempStr.clear();
    }
} while (m_hash.size() > 1);

Some things that stand out:

  1. It's a bad idea to readAll the file: it will allocate a file-sized block on the heap, only to calculate its hash and discard it. That's very wasteful. Instead, leverage the QCryptographicHash::addData(QIODevice*) : it will stream the data from the file, only keeping a small chunk in memory at any given time.

  2. You're explicitly keeping an extra copy of the entry list of a folder. This is likely unnecessary. Internally, the QDirIterator will use platform-specific ways of iterating a directory, without obtaining a copy of the entry list. Only the OS has the full list, the iterator only iterates it. You still need to hold the size,path->hash map of course.

  3. You're using Java iterators. These are quite verbose. The C++ standard-style iterators are supported by many containers, so you could easily substitute other containers from eg C++ standard library or boost to tweak performance/memory use.

  4. You're not doing enough error checking.

  5. The code seems overly verbose for the little it's actually doing. The encapsulation of everything into a class is probably also a Java habit, and rather unnecessary here.

Let's see what might be the most to-the-point, reasonably performant way of doing it. I'm skipping the UI niceties: you can either call it with no arguments to check in the current directory, or with arguments, the first of which will be used as the path to check in.

The auto & dupe = entries[size][hash.result()]; is a powerful expression. It will construct the potentially missing entries in the external and internal map.

// https://github.com/KubaO/stackoverflown/tree/master/questions/dupechecker-37557870
#include <QtCore>
#include <cstdio>
QTextStream out(stdout);
QTextStream err(stderr);

int check(const QString & path) {
   int unique = 0;
   //   size         hash        path
   QMap<qint64, QMap<QByteArray, QString>> entries;
   QDirIterator it(path, QDirIterator::Subdirectories | QDirIterator::FollowSymlinks);
   QCryptographicHash hash{QCryptographicHash::Sha256};
   while (it.hasNext()) {
      it.next();
      auto const info = it.fileInfo();
      if (info.isDir()) continue;
      auto const path = info.absoluteFilePath();
      auto const size = info.size();
      if (size == 0) continue; // all zero-sized files are "duplicates" but let's ignore them

      QFile file(path); // RAII class, no need to explicitly close
      if (!file.open(QIODevice::ReadOnly)) {
         err << "Can't open " << path << endl;
         continue;
      }
      hash.reset();
      hash.addData(&file);
      if (file.error() != QFile::NoError) {
         err << "Error reading " << path << endl;
         continue;
      }
      auto & dupe = entries[size][hash.result()];
      if (! dupe.isNull()) {
         // duplicate
         out << path << " is a duplicate of " << dupe << endl;
      } else {
         dupe = path;
         ++ unique;
      }
   }
   return unique;
}

int main(int argc, char ** argv) {
   QCoreApplication app{argc, argv};
   QDir dir;
   if (argc == 2)
      dir = app.arguments().at(1);
   auto unique = check(dir.absolutePath());
   out << "Finished. There were " << unique << " unique files." << endl;
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM