简体   繁体   English

如何在Linux服务器上抓取MS Word文档文本?

[英]How can I scrape MS Word document text on a Linux server?

I have been asked about creating a site where some users can upload Microsoft Word documents, then others can then search for uploaded documents that contain certain keywords. 有人问我要创建一个站点,一些用户可以在其中上载Microsoft Word文档,然后其他用户可以搜索包含某些关键字的上载文档。 The site would be sitting on a Linux server running PHP and MySQL. 该站点将位于运行PHP和MySQL的Linux服务器上。 I'm currently trying to find out if and how I can scrape this text from the documents. 我目前正在尝试找出是否以及如何从文档中抓取此文本。 If anyone can suggest a good way of going about doing this it would be much appreciated. 如果有人可以提出一个好的方法来进行此操作,将不胜感激。

Scraping text from the new docx format is trivial. 从新的docx格式中删除文本是很简单的。 The file itself is just a zip file, and if you look inside one, you will find a bunch of xml files. 该文件本身只是一个zip文件,如果您查看其中的内容,则会发现一堆xml文件。 The text is contained in word/document.xml within this zip file, and all the actual user-entered text will appear in <w:t> tags. 文本包含在此zip文件的word / document.xml中,并且所有实际的用户输入文本都将显示在<w:t>标记中。 If you extract all text that appears in <w:t> tags, you will have scraped the document. 如果您提取显示在<w:t>标记中的所有文本,则将刮掉该文档。

Here's a good example using catdoc : 这是一个使用catdoc的好例子:

function catdoc_string($str)
{
    // requires catdoc

    // write to temp file
    $tmpfname = tempnam ('/tmp','doc');
    $handle = fopen($tmpfname,'w');
    fwrite($handle,$a);
    fclose($handle);

    // run catdoc
    $ret = shell_exec('catdoc -ab '.escapeshellarg($tmpfname) .' 2>&1');

    // remove temp file
    unlink($tmpfname);

    if (preg_match('/^sh: line 1: catdoc/i',$ret)) {
        return false;
    }

    return trim($ret);
}

function catdoc_file($fname)
{
    // requires catdoc

    // run catdoc
    $ret = shell_exec('catdoc -ab '.escapeshellarg($fname) .' 2>&1');

    if (preg_match('/^sh: line 1: catdoc/i',$ret)) {
        return false;
    }

    return trim($ret);
}

Source 资源

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM