简体   繁体   English

使用 PHP 在文件中搜索文本

[英]Search Text In Files Using PHP

How to search text in some files like PDF, doc, docs or txt using PHP?如何使用 PHP 在某些文件(如 PDF、doc、docs 或 txt)中搜索文本? I want to do similar function as Full Text Search in MySQL, but this time, I'm directly search through files, not database.我想在 MySQL 中做与全文搜索类似的功能,但这次,我直接搜索文件,而不是数据库。

The search will do searching in many files that located in a folder.搜索将在位于文件夹中的许多文件中进行搜索。 Any suggestion, tips or solutions for this problem?对这个问题有什么建议、提示或解决方案吗?

I also noticed that, google also do searching through the files.我还注意到,谷歌也会搜索文件。

For searching PDF's you'll need a program like pdftotext, which converts content from a pdf to text.要搜索 PDF,您需要一个类似于 pdftotext 的程序,它将内容从 pdf 转换为文本。 For Word documents a simular thingy could be available (because of all the styling and encryption in Word files).对于 Word 文档,可以使用类似的东西(因为 Word 文件中的所有样式和加密)。

An example to search through PDF's (copied from one of my scripts (it's a snippet, not the entire code, but it should give you some understanding) where I extract keywords and store matches in a PDF-results-array.):一个搜索 PDF 的示例(从我的一个脚本(它是一个片段,不是整个代码,但它应该让您了解)中提取关键字并将匹配项存储在 PDF-results-array 中。):

foreach($keywords as $keyword)
{
    $keyword = strtolower($keyword);
    $file = ABSOLUTE_PATH_SITE."_uploaded/files/Transcripties/".$pdfFiles[$i];

    $content    = addslashes(shell_exec('/usr/bin/pdftotext \''.$file.'\' -'));
    $result     = substr_count(strtolower($content), $keyword);

    if($result > 0)
    {
        if(!in_array($pdfFiles[$i], $matchesOnPDF))
        {
            array_push($matchesOnPDF, array(                                                    
                    "matches"   => $result,
                    "type"      => "PDF",
                    "pdfFile"   => $pdfFiles[$i]));
        }
    }
}

Depending on the file type, you should convert the file to text and then search through it using ie file_get_contents() and str_pos() .根据文件类型,您应该将文件转换为文本,然后使用即file_get_contents()str_pos()搜索它。 To convert files to text, you have - beside others - the following tools available:要将文件转换为文本,除了其他工具之外,您还可以使用以下工具:

  • catdoc for word files用于 word 文件的catdoc
  • xlhtml for excel files用于 Excel 文件的xlhtml
  • ppthtml for powerpoint files用于PowerPoint文件的ppthtml
  • unrtf for RTF files用于 RTF 文件的unrtf
  • pdftotext for pdf files pdftotext用于 pdf 文件

If you are under a linux server you may use如果您使用的是 linux 服务器,则可以使用

grep -R "text to be searched for" ./   // location is everything under the actual directory

called from php using exec resulting in使用exec从 php 调用导致

cmd = 'grep -R "text to be searched for" ./';
$result = exec(grep);
print_r(result);

2021 I came across this and found something so I figure I will link to it... 2021 年我遇到了这个并找到了一些东西,所以我想我会链接到它......

Note: docx, pdfs and others are not regular text files and require more scripting and/or different libraries to read and/or edit each different type unless you can find an all in one library.注意:docx、pdf 和其他文件不是常规的文本文件,需要更多的脚本和/或不同的库来阅读和/或编辑每种不同的类型,除非您能找到一个库。 This means you would have to script out each different file type you want to search though including a normal text file.这意味着您必须编写要搜索的每种不同文件类型的脚本,但包括普通文本文件。 If you don't want to script it completely then you have to install each of the libraries you will need for each of the file types you want to read as well.如果您不想完全编写脚本,那么您还必须安装您想要读取的每种文件类型所需的每个库。 But you still need to script each to handle them as the library functions.但是您仍然需要编写每个脚本来将它们作为库函数来处理。

I found the basic answer here on the stack.我找到了基本答案在这里堆栈。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM