如何为全文索引优化倒排文件？

Question

I am making a simple program where I am using a sample of PDF files to build a full text indexing on my database. 我正在制作一个简单的程序，其中使用一个PDF文件样本在我的数据库上建立全文索引。 The idea is I read each PDF file, extract the words and store them in a hashset. 我的想法是阅读每个PDF文件，提取单词并将其存储在哈希集中。

Then, add each word in a loop to the table in MySQL along with it's file path. 然后，将循环中的每个单词及其文件路径添加到MySQL中的表中。 So, each word is looped through to be stored in each column until it finishes. 因此，每个单词都会循环遍历以存储在每一列中，直到完成为止。 It works perfectly fine. 它工作得很好。 However , when it comes to large PDF files which contains thousands and thousands of words, it might take some time to build the index table.In other words, it takes long time to save each word to the database as extraction of words is fast. 但是，对于包含成千上万个单词的大型PDF文件而言，可能需要花费一些时间来建立索引表。换句话说，由于每个单词的提取速度很快，因此将每个单词保存到数据库需要花费很长时间。

Code: 码：

public class IndexTest {

public static void main(String[] args) throws Exception {
    // write your code here
    //String path ="D:\\Full Text Indexing\\testIndex\\bell2009a.pdf";
    // HashSet<String> uniqueWords = new HashSet<>();
    /*StopWatch stopwatch = new StopWatch();
    stopwatch.start();*/
    File folder = new File("D:\\PDF1");
    File[] listOfFiles = folder.listFiles();

    for (File file : listOfFiles) {
        if (file.isFile()) {
            HashSet<String> uniqueWords = new HashSet<>();
            String path = "D:\\PDF1\\" + file.getName();
            try (PDDocument document = PDDocument.load(new File(path))) {

                if (!document.isEncrypted()) {

                    PDFTextStripper tStripper = new PDFTextStripper();
                    String pdfFileInText = tStripper.getText(document);
                    String lines[] = pdfFileInText.split("\\r?\\n");
                    for (String line : lines) {
                        String[] words = line.split(" ");

                        for (String word : words) {
                            uniqueWords.add(word);

                        }

                    }
                    // System.out.println(uniqueWords);

                }
            } catch (IOException e) {
                System.err.println("Exception while trying to read pdf document - " + e);
            }
            Object[] words = uniqueWords.toArray();
            String unique = uniqueWords.toString();
            //  System.out.println(words[1].toString());



            for(int i = 1 ; i <= words.length - 1 ; i++ ) {
                MysqlAccessIndex connection = new MysqlAccessIndex();
                connection.readDataBase(path, words[i].toString());

            }

            System.out.println("Completed");

        }
    }

SQL connection code: SQL连接代码：

 public class MysqlAccessIndex {

      public MysqlAccessIndex() throws Exception {
        Class.forName("com.mysql.jdbc.Driver");
        connect = DriverManager
                .getConnection("jdbc:mysql://126.32.3.178/fulltext_ltat?"
                        + "user=root&password=root123");
      //  statement = connect.createStatement();
        System.out.print("Connected");
    }


    public void readDataBase(String path,String word) throws Exception {
        try {




            statement = connect.createStatement();
            System.out.print("Connected");


            preparedStatement = connect
                    .prepareStatement("insert IGNORE into  fulltext_ltat.test_text values (?, ?) ");

            preparedStatement.setString(1, path);
            preparedStatement.setString(2, word);
            preparedStatement.executeUpdate();
            // resultSet = statement
            //.executeQuery("select * from fulltext_ltat.index_detail");



            //  writeResultSet(resultSet);
        } catch (Exception e) {
            throw e;
        } finally {
            close();
        }

    }

Is there any suggestion to improve or optimize the performance issue? 是否有任何建议可以改善或优化性能问题？

Answer 1

The issue lies in the following code: 问题出在以下代码中：

// This will load the MySQL driver, each DB has its own driver
Class.forName("com.mysql.jdbc.Driver");
// Setup the connection with the DB
connect = DriverManager.getConnection(
        "jdbc:mysql://126.32.3.20/fulltext_ltat?" + "user=root&password=root");

You're recreating the connection for every word you're inserting into your database. 您正在为要插入数据库的每个单词重新创建连接。 A better way would be something like this: 更好的方法是这样的：

public MysqlAccess() {
    connect = DriverManager
                .getConnection("jdbc:mysql://126.32.3.20/fulltext_ltat?"
                        + "user=root&password=root");
}

This way you're only creating the connect the first time an instance of that class is created. 这样，你只创建connect该类的第一次实例被创建。 Inside your main method you have to create the MysqlAccess instance outside your for loop, so it only gets created once. 在您的main方法内部，您必须在for循环之外创建MysqlAccess实例，因此该实例仅创建一次。

MysqlAccess will look something like this: MysqlAccess将如下所示：

public class MysqlAccess {

    private Connection connect = null;
    private Statement statement = null;
    private PreparedStatement preparedStatement = null;
    private ResultSet resultSet = null;

    public MysqlAccess() {
        // Setup the connection with the DB
        connect = DriverManager.getConnection(
                "jdbc:mysql://126.32.3.20/fulltext_ltat?" + "user=root&password=root");
    }

    public void readDataBase(String path, String word) throws Exception {
        try {
            // Statements allow to issue SQL queries to the database
            statement = connect.createStatement();
            System.out.print("Connected");
            // Result set get the result of the SQL query

            preparedStatement = connect.prepareStatement(
                    "insert IGNORE into  fulltext_ltat.test_text values (default,?, ?) ");

            preparedStatement.setString(1, path);
            preparedStatement.setString(2, word);
            preparedStatement.executeUpdate();

        } catch (Exception e) {
            throw e;
        } finally {
            close();
        }

    }

    private void writeResultSet(ResultSet resultSet) throws SQLException {
        // ResultSet is initially before the first data set
        while (resultSet.next()) {
            // It is possible to get the columns via name
            // also possible to get the columns via the column number
            // which starts at 1
            // e.g. resultSet.getSTring(2);
            String path = resultSet.getString("path");
            String word = resultSet.getString("word");

            System.out.println();
            System.out.println("path: " + path);
            System.out.println("word: " + word);

        }
    }
}

如何为全文索引优化倒排文件？

问题描述

1 个解决方案

解决方案1
1 2018-10-22 08:53:48

如何为全文索引优化倒排文件？

问题描述

1 个解决方案

解决方案1 1 2018-10-22 08:53:48

解决方案1
1 2018-10-22 08:53:48