[英]How can I scrape MS Word document text on a Linux server?
I have been asked about creating a site where some users can upload Microsoft Word documents, then others can then search for uploaded documents that contain certain keywords. 有人问我要创建一个站点,一些用户可以在其中上载Microsoft Word文档,然后其他用户可以搜索包含某些关键字的上载文档。 The site would be sitting on a Linux server running PHP and MySQL. 该站点将位于运行PHP和MySQL的Linux服务器上。 I'm currently trying to find out if and how I can scrape this text from the documents. 我目前正在尝试找出是否以及如何从文档中抓取此文本。 If anyone can suggest a good way of going about doing this it would be much appreciated. 如果有人可以提出一个好的方法来进行此操作,将不胜感激。
Scraping text from the new docx format is trivial. 从新的docx格式中删除文本是很简单的。 The file itself is just a zip file, and if you look inside one, you will find a bunch of xml files. 该文件本身只是一个zip文件,如果您查看其中的内容,则会发现一堆xml文件。 The text is contained in word/document.xml within this zip file, and all the actual user-entered text will appear in <w:t> tags. 文本包含在此zip文件的word / document.xml中,并且所有实际的用户输入文本都将显示在<w:t>标记中。 If you extract all text that appears in <w:t> tags, you will have scraped the document. 如果您提取显示在<w:t>标记中的所有文本,则将刮掉该文档。
Here's a good example using catdoc : 这是一个使用catdoc的好例子:
function catdoc_string($str)
{
// requires catdoc
// write to temp file
$tmpfname = tempnam ('/tmp','doc');
$handle = fopen($tmpfname,'w');
fwrite($handle,$a);
fclose($handle);
// run catdoc
$ret = shell_exec('catdoc -ab '.escapeshellarg($tmpfname) .' 2>&1');
// remove temp file
unlink($tmpfname);
if (preg_match('/^sh: line 1: catdoc/i',$ret)) {
return false;
}
return trim($ret);
}
function catdoc_file($fname)
{
// requires catdoc
// run catdoc
$ret = shell_exec('catdoc -ab '.escapeshellarg($fname) .' 2>&1');
if (preg_match('/^sh: line 1: catdoc/i',$ret)) {
return false;
}
return trim($ret);
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.