[英]How to optimize inverted file for full text indexing?
我正在制作一個簡單的程序,其中使用一個PDF文件樣本在我的數據庫上建立全文索引。 我的想法是閱讀每個PDF文件,提取單詞並將其存儲在哈希集中。
然后,將循環中的每個單詞及其文件路徑添加到MySQL中的表中。 因此,每個單詞都會循環遍歷以存儲在每一列中,直到完成為止。 它工作得很好。 但是,對於包含成千上萬個單詞的大型PDF文件而言,可能需要花費一些時間來建立索引表。換句話說,由於每個單詞的提取速度很快,因此將每個單詞保存到數據庫需要花費很長時間。
碼:
public class IndexTest {
public static void main(String[] args) throws Exception {
// write your code here
//String path ="D:\\Full Text Indexing\\testIndex\\bell2009a.pdf";
// HashSet<String> uniqueWords = new HashSet<>();
/*StopWatch stopwatch = new StopWatch();
stopwatch.start();*/
File folder = new File("D:\\PDF1");
File[] listOfFiles = folder.listFiles();
for (File file : listOfFiles) {
if (file.isFile()) {
HashSet<String> uniqueWords = new HashSet<>();
String path = "D:\\PDF1\\" + file.getName();
try (PDDocument document = PDDocument.load(new File(path))) {
if (!document.isEncrypted()) {
PDFTextStripper tStripper = new PDFTextStripper();
String pdfFileInText = tStripper.getText(document);
String lines[] = pdfFileInText.split("\\r?\\n");
for (String line : lines) {
String[] words = line.split(" ");
for (String word : words) {
uniqueWords.add(word);
}
}
// System.out.println(uniqueWords);
}
} catch (IOException e) {
System.err.println("Exception while trying to read pdf document - " + e);
}
Object[] words = uniqueWords.toArray();
String unique = uniqueWords.toString();
// System.out.println(words[1].toString());
for(int i = 1 ; i <= words.length - 1 ; i++ ) {
MysqlAccessIndex connection = new MysqlAccessIndex();
connection.readDataBase(path, words[i].toString());
}
System.out.println("Completed");
}
}
SQL連接代碼:
public class MysqlAccessIndex {
public MysqlAccessIndex() throws Exception {
Class.forName("com.mysql.jdbc.Driver");
connect = DriverManager
.getConnection("jdbc:mysql://126.32.3.178/fulltext_ltat?"
+ "user=root&password=root123");
// statement = connect.createStatement();
System.out.print("Connected");
}
public void readDataBase(String path,String word) throws Exception {
try {
statement = connect.createStatement();
System.out.print("Connected");
preparedStatement = connect
.prepareStatement("insert IGNORE into fulltext_ltat.test_text values (?, ?) ");
preparedStatement.setString(1, path);
preparedStatement.setString(2, word);
preparedStatement.executeUpdate();
// resultSet = statement
//.executeQuery("select * from fulltext_ltat.index_detail");
// writeResultSet(resultSet);
} catch (Exception e) {
throw e;
} finally {
close();
}
}
是否有任何建議可以改善或優化性能問題?
問題出在以下代碼中:
// This will load the MySQL driver, each DB has its own driver
Class.forName("com.mysql.jdbc.Driver");
// Setup the connection with the DB
connect = DriverManager.getConnection(
"jdbc:mysql://126.32.3.20/fulltext_ltat?" + "user=root&password=root");
您正在為要插入數據庫的每個單詞重新創建連接。 更好的方法是這樣的:
public MysqlAccess() {
connect = DriverManager
.getConnection("jdbc:mysql://126.32.3.20/fulltext_ltat?"
+ "user=root&password=root");
}
這樣,你只創建connect
該類的第一次實例被創建。 在您的main
方法內部,您必須在for循環之外創建MysqlAccess
實例,因此該實例僅創建一次。
MysqlAccess
將如下所示:
public class MysqlAccess {
private Connection connect = null;
private Statement statement = null;
private PreparedStatement preparedStatement = null;
private ResultSet resultSet = null;
public MysqlAccess() {
// Setup the connection with the DB
connect = DriverManager.getConnection(
"jdbc:mysql://126.32.3.20/fulltext_ltat?" + "user=root&password=root");
}
public void readDataBase(String path, String word) throws Exception {
try {
// Statements allow to issue SQL queries to the database
statement = connect.createStatement();
System.out.print("Connected");
// Result set get the result of the SQL query
preparedStatement = connect.prepareStatement(
"insert IGNORE into fulltext_ltat.test_text values (default,?, ?) ");
preparedStatement.setString(1, path);
preparedStatement.setString(2, word);
preparedStatement.executeUpdate();
} catch (Exception e) {
throw e;
} finally {
close();
}
}
private void writeResultSet(ResultSet resultSet) throws SQLException {
// ResultSet is initially before the first data set
while (resultSet.next()) {
// It is possible to get the columns via name
// also possible to get the columns via the column number
// which starts at 1
// e.g. resultSet.getSTring(2);
String path = resultSet.getString("path");
String word = resultSet.getString("word");
System.out.println();
System.out.println("path: " + path);
System.out.println("word: " + word);
}
}
}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.