简体   繁体   English

PHP 搜索引擎,用于带索引的文本文件

[英]PHP search engine for text files with indexing

I have some text files inside a directory (and its sub directories).我在一个目录(及其子目录)中有一些文本文件。 The number of text files will be (50000+) and the directory is outside 'public_html':文本文件的数量将是(50000+)并且目录在“public_html”之外:

text_root_dir
|-- |-- `001
           |-- text0003.txt
           |-- text0004.txt
           |-- text0005.txt
           |-- `001_a
                   |-- text0006.txt
                   |-- text0007.txt
                   |-- text0008.txt
    |-- text0001.txt
    |-- text0002.txt

The text file details are saved in a MySQL table (with the ' art_textfile ' storing the text file name and ' art_path ' column storing the file path):文本文件详细信息保存在 MySQL 表中(“ art_textfile ”存储文本文件名,“ art_path ”列存储文件路径):

CREATE TABLE `stxt_articles` (
  `art_id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT ,
  `art_title` VARCHAR(127) NOT NULL,
  `art_author`  VARCHAR(255) NOT NULL,
  `art_textfile`  VARCHAR(255) NOT NULL, /* TEXT FILE NAME */
  `art_path` VARCHAR(255) NOT NULL, /* TEXT FILE PATH */
    PRIMARY KEY(`art_id`)
  ) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

I am using PHP/MySQL (LAMP) and want to do a string search on the text files (with regular expressions if possible).我正在使用 PHP/MySQL (LAMP) 并希望对文本文件进行字符串搜索(如果可能,使用正则表达式)。 The methods that will work logically are:合乎逻辑的方法是:

  1. Storing the contents in the MySQL database and perform a search with MySQL query (LIKE 's%')将内容存储在 MySQL 数据库中并使用 MySQL 查询(LIKE 's%')执行搜索
  2. Scan the directory by PHP and search within each text file for a search expression.通过 PHP 扫描目录并在每个文本文件中搜索搜索表达式。

But with a large dataset of 5000 +files (tend to grow over time), the above options are not practical.但是对于 5000 + 个文件的大型数据集(随着时间的推移会增长),上述选项并不实用。 It will be too slow to use.使用起来会太慢。

What I am looking for is a PHP/MySQL search idea which creates index for text files and do a search.我正在寻找的是一个 PHP/MySQL 搜索想法,它为文本文件创建索引并进行搜索。 Pretty much what Lucene does in JAVA. Lucene 在 JAVA 中所做的几乎一样。 Maybe I can refer it as a lucene alternative in PHP with MySQL.也许我可以将其称为 PHP 和 MySQL 中的 lucene 替代品。

Thanks for reading this far.感谢您阅读本文。 Also thanks for your thoughts.也感谢您的想法。

Using something like AJAX seems to be pretty fast, I'm sorry if I misunderstood your post.使用 AJAX 之类的东西似乎很快,如果我误解了你的帖子,我很抱歉。 (Code will need tweaks for doing exactly what you want but should be a good starting point) (代码需要调整才能完全按照您的意愿进行,但应该是一个很好的起点)

index.html索引.html

<html>
<head>
<script>
function showResult(str) {
  if (str.length==0) {
    document.getElementById("search").innerHTML="";
    document.getElementById("search").style.border="0px";
    return;
  }
  var xmlhttp=new XMLHttpRequest();
  xmlhttp.onreadystatechange=function() {
    if (this.readyState==4 && this.status==200) {
      document.getElementById("search").innerHTML=this.responseText;
      document.getElementById("search").style.border="1px solid #A5ACB2";
    }
  }
  xmlhttp.open("GET","search.php?q="+str,true);
  xmlhttp.send();
}
</script>
</head>
<body>

<form>
<input type="text" size="30" onkeyup="showResult(this.value)">
<div id="search"></div>
</form>

</body>
</html>

search.php搜索.php

<?php
//get the q parameter from URL
$files = scandir("FOLDER")
$q=$_GET["q"];

//lookup all links from the xml file if length of q>0
if (strlen($q)>0) {
  $hint="";
  $directory = 'Directory';
  $results_array = array();

  if (is_dir($directory)) {
  if ($handle = opendir($directory)) {
    while(($file = readdir($handle)) !== FALSE) {
      $results_array[] = $file;
    }
    closedir($handle);
  }
}


foreach($results_array as $value) {
  if(str_starts_with($value, $q)){
    echo $value;
  }
}

50000 file opens, alone, would take a long time.单独打开 50000 个文件需要很长时间。 That does not include the time to search the text in each.这不包括搜索每个文本的时间。

Load the data into a MySQL table with ENGINE=InnoDB (not the deprecated MyISAM).将数据加载到 ENGINE=InnoDB(不是已弃用的 MyISAM)的 MySQL 表中。 Then you can do very fast queries that are "word oriented" -- this is meeting FULLTEXT's limitation.然后,您可以进行非常快速的“面向单词”的查询——这符合 FULLTEXT 的限制。

You can also do LIKEs (slow) or REGEXPs (even slower).可以执行LIKEs (慢)或REGEXPs (甚至更慢)。

What I do in such situations is allow the user to use either LIKE syntax or REGEXP syntax or simple word(s).我在这种情况下所做的是允许用户使用 LIKE 语法或 REGEXP 语法或简单的单词。 Add FULLTEXT(txt) (assuming txt contains all the text you need to search).添加FULLTEXT(txt) (假设txt包含您需要搜索的所有文本)。 Then my code goes something like:然后我的代码类似于:

If it looks like "word(s)" of at least 3 letters, stick a '+' in front of each word and build MATCH(txt) AGAINST ("+John +Doe" IN BOOLEAN MODE) .如果它看起来像至少 3 个字母的“单词”,请在每个单词前面加上一个 '+' 并构建MATCH(txt) AGAINST ("+John +Doe" IN BOOLEAN MODE) In most situations it will be very fast.在大多数情况下,它会非常快。

Else, if I see %, then I build a LIKE expression and assume the user knows LIKE syntax.否则,如果我看到 %,那么我构建一个LIKE表达式并假设用户知道LIKE语法。

Else, if I assume it is a regexp and go down that path.否则,如果我假设它是一个正则表达式和 go 沿着那条路径。

It's imperfect, but it covers a lot of bases.它是不完美的,但它涵盖了很多基础。

If the users understand that "words" are faster, they will gravitate that way.如果用户理解“单词”更快,他们就会被这种方式吸引。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM