如何提高此代碼的速度？

Question

我正在嘗試將所有googlebooks-1gram文件導入postgresql數據庫。 我為此編寫了以下Java代碼：

public class ToPostgres {

    public static void main(String[] args) throws Exception {
        String filePath = "./";
        List<String> files = new ArrayList<String>();
        for (int i =0; i < 10; i++) {
            files.add(filePath+"googlebooks-eng-all-1gram-20090715-"+i+".csv");
        }
        Connection c = null;
        try {
            c = DriverManager.getConnection("jdbc:postgresql://localhost/googlebooks",
                    "postgres", "xxxxxx");
        } catch (SQLException e) {
            e.printStackTrace();
        }

        if (c != null) {
            try {
                PreparedStatement wordInsert = c.prepareStatement(
                    "INSERT INTO words (word) VALUES (?)", Statement.RETURN_GENERATED_KEYS
                );
                PreparedStatement countInsert = c.prepareStatement(
                    "INSERT INTO wordcounts (word_id, \"year\", total_count, total_pages, total_books) " +
                    "VALUES (?,?,?,?,?)"
                );
                String lastWord = "";
                Long lastId = -1L;
                for (String filename: files) {
                    BufferedReader input =  new BufferedReader(new FileReader(new File(filename)));
                    String line = "";
                    while ((line = input.readLine()) != null) {
                        String[] data = line.split("\t");
                        Long id = -1L;
                        if (lastWord.equals(data[0])) {
                            id = lastId;
                        } else {
                            wordInsert.setString(1, data[0]);
                            wordInsert.executeUpdate();
                            ResultSet resultSet = wordInsert.getGeneratedKeys();
                            if (resultSet != null && resultSet.next()) 
                            {
                                id = resultSet.getLong(1);
                            }
                        }
                        countInsert.setLong(1, id);
                        countInsert.setInt(2, Integer.parseInt(data[1]));
                        countInsert.setInt(3, Integer.parseInt(data[2]));
                        countInsert.setInt(4, Integer.parseInt(data[3]));
                        countInsert.setInt(5, Integer.parseInt(data[4]));
                        countInsert.executeUpdate();
                        lastWord = data[0];
                        lastId = id;
                    }
                }
            } catch (SQLException e) {
                e.printStackTrace();
            }
        }
    }

}

但是，當運行它約3個小時時，它只在wordcounts表中放置了1.000.000個條目。 當我檢查整個1gram數據集中的行數時，它是500.000.000行。 所以進口一切大概需要62.5天，我可以接受它大約一周進口，但是2個月？ 我認為我在這里做了一些嚴重錯誤的事情（我確實有一台24/7全天候運行的服務器，所以我實際上可以運行它這么長時間，但速度會更快XD）

編輯：這段代碼是我解決它的方式：

public class ToPostgres {

    public static void main(String[] args) throws Exception {
        String filePath = "./";
        List<String> files = new ArrayList<String>();
        for (int i =0; i < 10; i++) {
            files.add(filePath+"googlebooks-eng-all-1gram-20090715-"+i+".csv");
        }
        Connection c = null;
        try {
            c = DriverManager.getConnection("jdbc:postgresql://localhost/googlebooks",
                    "postgres", "xxxxxx");
        } catch (SQLException e) {
            e.printStackTrace();
        }

        if (c != null) {
            c.setAutoCommit(false);
            try {
                PreparedStatement wordInsert = c.prepareStatement(
                    "INSERT INTO words (id, word) VALUES (?,?)"
                );
                PreparedStatement countInsert = c.prepareStatement(
                    "INSERT INTO wordcounts (word_id, \"year\", total_count, total_pages, total_books) " +
                    "VALUES (?,?,?,?,?)"
                );
                String lastWord = "";
                Long id = 0L;
                for (String filename: files) {
                    BufferedReader input =  new BufferedReader(new FileReader(new File(filename)));
                    String line = "";
                    int i = 0;
                    while ((line = input.readLine()) != null) {
                        String[] data = line.split("\t");
                        if (!lastWord.equals(data[0])) {
                            id++;
                            wordInsert.setLong(1, id);
                            wordInsert.setString(2, data[0]);
                            wordInsert.executeUpdate();
                        }
                        countInsert.setLong(1, id);
                        countInsert.setInt(2, Integer.parseInt(data[1]));
                        countInsert.setInt(3, Integer.parseInt(data[2]));
                        countInsert.setInt(4, Integer.parseInt(data[3]));
                        countInsert.setInt(5, Integer.parseInt(data[4]));
                        countInsert.executeUpdate();
                        lastWord = data[0];
                        if (i % 10000 == 0) {
                            c.commit();
                        }
                        if (i % 100000 == 0) {
                            System.out.println(i+" mark file "+filename);
                        }
                        i++;
                    }
                    c.commit();
                }
            } catch (SQLException e) {
                e.printStackTrace();
            }
        }
    }

}

我現在在大約15分鍾內達到了150萬行。 這對我來說足夠快了，謝謝大家！

Answer 1

JDBC連接默認啟用自動提交，它帶有每語句開銷。 嘗試禁用它：

c.setAutoCommit(false)

然后批量提交，類似於：

long ops = 0;

for(String filename : files) {
    // ...
    while ((line = input.readLine()) != null) {
        // insert some stuff...

        ops ++;

        if(ops % 1000 == 0) {
            c.commit();
        }
    }
}

c.commit();

Answer 2

如果您的表具有索引，則刪除它們，插入數據以及稍后重新創建索引可能會更快。

設置autocommit off，並且每隔10 000條記錄進行一次手動提交（查看文檔中的合理值 - 有一些限制）也可以加快速度。

自己生成索引/外鍵並跟蹤它應該比wordInsert.getGeneratedKeys();更快wordInsert.getGeneratedKeys(); 但我不確定，你的內容是否有可能。

有一種稱為“批量插入”的方法。 我不記得細節，但它是搜索的起點。

Answer 3

寫它來做線程，同時運行4個線程，或者將它分成幾部分（從配置文件中讀取）並將它分發給X機器並讓它們獲取數據togeather。

Answer 4

使用批處理語句同時執行多個插入，而不是一次執行一個INSERT。

另外，我會刪除算法的一部分，它會在每次插入words表后更新單詞計數，而只需在插入words完成后計算所有單詞計數。

Answer 5

另一種方法是進行批量插入而不是單個插入。 看到這個問題什么是最快的方式進行批量插入Postgres？ 欲獲得更多信息。

Answer 6

創建線程

String lastWord = "";
    Long lastId = -1L;
    PreparedStatement wordInsert;
    PreparedStatement countInsert ;
    public class ToPostgres {
        public void main(String[] args) throws Exception {
            String filePath = "./";
            List<String> files = new ArrayList<String>();
            for (int i =0; i < 10; i++) {
                files.add(filePath+"googlebooks-eng-all-1gram-20090715-"+i+".csv");
            }
            Connection c = null;
            try {
                c = DriverManager.getConnection("jdbc:postgresql://localhost/googlebooks",
                        "postgres", "xxxxxx");
            } catch (SQLException e) {
                e.printStackTrace();
            }

            if (c != null) {
                try {
                    wordInsert = c.prepareStatement(
                        "INSERT INTO words (word) VALUES (?)", Statement.RETURN_GENERATED_KEYS
                    );
                    countInsert = c.prepareStatement(
                        "INSERT INTO wordcounts (word_id, \"year\", total_count, total_pages, total_books) " +
                        "VALUES (?,?,?,?,?)"
                    );
                    for (String filename: files) {
                        new MyThread(filename). start();
                    }
                } catch (SQLException e) {
                    e.printStackTrace();
                }
            }
        }

    }
    class MyThread extends Thread{
        String file;
        public MyThread(String file) {
            this.file = file;
        }
        @Override
        public void run() {         
            try {
                super.run();
                BufferedReader input =  new BufferedReader(new FileReader(new File(file)));
                String line = "";
                while ((line = input.readLine()) != null) {
                    String[] data = line.split("\t");
                    Long id = -1L;
                    if (lastWord.equals(data[0])) {
                        id = lastId;
                    } else {
                        wordInsert.setString(1, data[0]);
                        wordInsert.executeUpdate();
                        ResultSet resultSet = wordInsert.getGeneratedKeys();
                        if (resultSet != null && resultSet.next()) 
                        {
                            id = resultSet.getLong(1);
                        }
                    }
                    countInsert.setLong(1, id);
                    countInsert.setInt(2, Integer.parseInt(data[1]));
                    countInsert.setInt(3, Integer.parseInt(data[2]));
                    countInsert.setInt(4, Integer.parseInt(data[3]));
                    countInsert.setInt(5, Integer.parseInt(data[4]));
                    countInsert.executeUpdate();
                    lastWord = data[0];
                    lastId = id;
                }
            } catch (NumberFormatException e) {
                e.printStackTrace();
            } catch (FileNotFoundException e) {
                e.printStackTrace();
            } catch (IOException e) {
                e.printStackTrace();
            } catch (SQLException e) {
                e.printStackTrace();
            }
        }

如何提高此代碼的速度？

問題描述

6 個解決方案

解決方案1
4 已采納 2011-04-09 15:33:06

解決方案2
3 2011-04-09 15:37:01

解決方案3
2 2011-04-09 15:22:22

解決方案4
0 2011-04-09 15:34:08

解決方案5
0 2011-04-09 15:36:56

解決方案6
0 2011-04-09 15:43:36

如何提高此代碼的速度？

問題描述

6 個解決方案

解決方案1 4 已采納 2011-04-09 15:33:06

解決方案2 3 2011-04-09 15:37:01

解決方案3 2 2011-04-09 15:22:22

解決方案4 0 2011-04-09 15:34:08

解決方案5 0 2011-04-09 15:36:56

解決方案6 0 2011-04-09 15:43:36

解決方案1
4 已采納 2011-04-09 15:33:06

解決方案2
3 2011-04-09 15:37:01

解決方案3
2 2011-04-09 15:22:22

解決方案4
0 2011-04-09 15:34:08

解決方案5
0 2011-04-09 15:36:56

解決方案6
0 2011-04-09 15:43:36