讀取和比較兩個大文件

Question

我想閱讀並比較兩個文件的所有行，我解釋說，我想為每個密碼哈希（來自我的 test.txt 文件）找到相同的哈希（來自 password.txt 文件）。 問題是它應該足夠快（我會說 10M 的 password.txt 和 1M 的 test.txt 最多 45 分鍾）。

我目前有這段代碼

private static void bufferedReaderFilePasswordFirst() {
    Path path = Paths.get("C:\\Users\\basil\\OneDrive - Haute Ecole Bruxelles Brabant (HE2B)\\Documents\\NetBeansProjects\\sha256\\passwords.txt");
    Path pathUser = Paths.get("C:\\Users\\basil\\OneDrive - Haute Ecole Bruxelles Brabant (HE2B)\\Documents\\NetBeansProjects\\sha256\\test.txt");
    int nbOfLine = 0;
    StringBuffer oui = new StringBuffer();

    try (BufferedReader readerPasswordGenerate = Files.newBufferedReader(path, Charset.forName("UTF-8"));) {

        String currentLineUser = null;
        String currentLinePassword = null;

        long start = System.nanoTime();

        while (((currentLinePassword = readerPasswordGenerate.readLine()) != null)) {
            BufferedReader readerPasswordUser = Files.newBufferedReader(pathUser, Charset.forName("UTF-8"));
            while ((currentLineUser = readerPasswordUser.readLine()) != null) {
                String firstWord = currentLinePassword.substring(0, currentLinePassword.indexOf(":"));
                if ((firstWord.charAt(0) == currentLineUser.charAt(0)) 
                    && (firstWord.charAt(14) == currentLineUser.charAt(14)) 
                    && (firstWord.charAt(31) == currentLineUser.charAt(31)) 
                    && (firstWord.charAt(63) == currentLineUser.charAt(63))
                ) {
                    if (firstWord.equals(currentLineUser)) {
                        String secondWord = currentLinePassword.substring(currentLinePassword.lastIndexOf(":") + 1);

                        oui.append(secondWord).append(System.lineSeparator());
                    }
                }
            }
            if (nbOfLine % 300 == 0) {
                System.out.println("We are at the " + nbOfLine);
                final long consumed = System.nanoTime() - start;
                final long totConsumed = TimeUnit.NANOSECONDS.toMillis(consumed);
                final double tot = (double) totConsumed;
                System.out.printf("Not done. Took %s seconds", (tot / 1000));
                System.out.println(oui + " oui");
            }
            nbOfLine++;
        }
        System.out.println(oui);
        final long consumed = System.nanoTime() - start;
        final long totConsumed = TimeUnit.NANOSECONDS.toMillis(consumed);
        final double tot = (double) totConsumed;
        System.out.printf("Done. Took %s seconds", (tot / 1000));
    } catch (IOException ex) {
        ex.printStackTrace(); //handle an exception here
    }
}

在這段代碼中，我只是比較test.txt中的每個元素，如果密碼 hash 中的相應元素相同。

password.txt包含所有元素： hash:password和test.txt僅包含： hash

謝謝

Answer 1

在這段代碼中，我只是比較 test.txt 中的每個元素，如果密碼 hash 中的相應元素相同。

如果您熟悉 Big-O 表示法，您可能會認識到這意味着您的算法在 O(n^2) 時間內運行。 在您的特定情況下，對於 test.txt 中的 1,000,000 行中的每一行，您將進行 10,000,000 次比較，總共進行 10,000,000,000,000 次比較。 要實現在 45 分鍾內運行它的目標，您需要每秒進行 37億次比較。 作為比較，我筆記本電腦中的 i7 最高運行速度為 3.9GHz（每秒十億個周期），執行其中一個比較需要的不僅僅是一個 cpu 周期。

您可以通過首先將 password.txt 讀入 HashMap（10,000,000 次操作）將時間復雜度降低到 O(n)。 從那里開始，來自 test.txt 的任何單個檢查只需要一次操作（總共 1,000,000 次），總共 11,000,000 次操作。 這意味着您只需每秒執行約 4,000 次操作（減少 99.99989%）即可在 45 分鍾內完成，這更加可行。

這是一些偽代碼來說明它的樣子：

// I like Scanner over BufferedReader for reading files. Use whatever you like.
Scanner readPassword = new Scanner(new File("password.txt"));

// Load all password/hash pairings from password.txt into a HashMap for quick lookups
HashMap<String, List> passwords = new HashMap<>();
while (readPassword.hasNextLine()) {
  String line = readPassword.nextLine();
  String[] lineParts = line.split(":");
  String hash = lineParts[0];  
  String password = lineParts[1];
  
  // If we haven't seen the hash before, create a new list to store its associated passwords
  if (passwords.get(hash) == null) {
    passwords.put(hash, new LinkedList<>());
  }

  // Add the password to the list of all passwords that have this hash
  passwords.get(hash).add(password);  
}

// Perform all the lookups from test.txt
Scanner readTest = new Scanner(new File("test.txt"));
while (readTest.hasNextLine()) {
  String testHash = readTest.nextLine();
  List matchingPasswords = passwords.get(testHash);
  // Now do whatever you want with the list of associated passwords...
}

旁注：

查看您的代碼，您似乎有一些我在此代碼段中沒有考慮的額外要求（例如時間）。 我相信您可以弄清楚如何整合這些額外的要求。
這里的一些更學術的人可能會對我的 Big-O 描述/分析的一些部分提出異議。 我敢肯定，如果您對此感興趣，他們對這篇文章的評論將更詳細地闡述該主題。

讀取和比較兩個大文件

問題描述

1 個解決方案

解決方案1
4 已采納 2021-04-06 19:44:35

讀取和比較兩個大文件

問題描述

1 個解決方案

解決方案1 4 已采納 2021-04-06 19:44:35

解決方案1
4 已采納 2021-04-06 19:44:35