如何在Linux服务器上抓取MS Word文档文本？

Question

I have been asked about creating a site where some users can upload Microsoft Word documents, then others can then search for uploaded documents that contain certain keywords. 有人问我要创建一个站点，一些用户可以在其中上载Microsoft Word文档，然后其他用户可以搜索包含某些关键字的上载文档。 The site would be sitting on a Linux server running PHP and MySQL. 该站点将位于运行PHP和MySQL的Linux服务器上。 I'm currently trying to find out if and how I can scrape this text from the documents. 我目前正在尝试找出是否以及如何从文档中抓取此文本。 If anyone can suggest a good way of going about doing this it would be much appreciated. 如果有人可以提出一个好的方法来进行此操作，将不胜感激。

Answer 1

Scraping text from the new docx format is trivial. 从新的docx格式中删除文本是很简单的。 The file itself is just a zip file, and if you look inside one, you will find a bunch of xml files. 该文件本身只是一个zip文件，如果您查看其中的内容，则会发现一堆xml文件。 The text is contained in word/document.xml within this zip file, and all the actual user-entered text will appear in <w:t> tags. 文本包含在此zip文件的word / document.xml中，并且所有实际的用户输入文本都将显示在<w：t>标记中。 If you extract all text that appears in <w:t> tags, you will have scraped the document. 如果您提取显示在<w：t>标记中的所有文本，则将刮掉该文档。

Answer 2

Here's a good example using catdoc : 这是一个使用catdoc的好例子：

function catdoc_string($str)
{
    // requires catdoc

    // write to temp file
    $tmpfname = tempnam ('/tmp','doc');
    $handle = fopen($tmpfname,'w');
    fwrite($handle,$a);
    fclose($handle);

    // run catdoc
    $ret = shell_exec('catdoc -ab '.escapeshellarg($tmpfname) .' 2>&1');

    // remove temp file
    unlink($tmpfname);

    if (preg_match('/^sh: line 1: catdoc/i',$ret)) {
        return false;
    }

    return trim($ret);
}

function catdoc_file($fname)
{
    // requires catdoc

    // run catdoc
    $ret = shell_exec('catdoc -ab '.escapeshellarg($fname) .' 2>&1');

    if (preg_match('/^sh: line 1: catdoc/i',$ret)) {
        return false;
    }

    return trim($ret);
}

Source 资源

如何在Linux服务器上抓取MS Word文档文本？

问题描述

2 个解决方案

解决方案1
4 2010-11-24 11:01:20

解决方案2
2 已采纳 2010-11-24 10:53:17

如何在Linux服务器上抓取MS Word文档文本？

问题描述

2 个解决方案

解决方案1 4 2010-11-24 11:01:20

解决方案2 2 已采纳 2010-11-24 10:53:17

解决方案1
4 2010-11-24 11:01:20

解决方案2
2 已采纳 2010-11-24 10:53:17