简体   繁体   English

使用PHP读取Word XML文件

[英]Read a Word XML file using PHP

Does anyone have any recommendations for how to read a Word (2007-2013) file using PHP? 有没有人对如何使用PHP读取Word(2007-2013)文件有任何建议? I'm using the build in styles to mark up a word document, and would like to read it with PHP ideally in order to analyze the contents. 我正在使用内置样式来标记word文档,并且希望理想地使用PHP来阅读它以便分析内容。 I've tried searching google and this site, but no luck. 我试过搜索谷歌和这个网站,但没有运气。 If anyone has any experience with this or ideas on where I would get started, it would be appreciated. 如果有任何人对此我有任何经验或想法,我将不胜感激。

If you are just interested in the content of the Word Document and for example turning it into an HTML page. 如果您只对Word文档的内容感兴趣,例如将其转换为HTML页面。 I would not recommend PHPWord as its internal structure is quite complex. 我不推荐PHPWord,因为它的内部结构非常复杂。 The following code uses only PHP native functionalities to read all Paragraphs of an docx Document. 以下代码仅使用PHP本机功能来读取docx文档的所有段落。

 /*DOCX is actually a ZIP file containing other files, document.xml 
holds the text of you document, sadly not the styles, you need to drill
further into other files to extract the styles*/

    $result = file_get_contents('zip://word.docx#word/document.xml');

    //Load the document XML into PHP's SimpleXML
    $xml = simplexml_load_string($result,null, 0, 'w', true);
    $body = $xml->body;
    foreach($body[0] as $key => $value){
        echo "<p>";
        if($key == "p"){
            foreach ($value->r as $kkey => $vvalue) {
                echo (string)$vvalue->t;
            }
        }
        echo "</p>";
    }

You can use PHPWord! 你可以使用PHPWord! I believe that has a function to read docs. 我相信它具有阅读文档的功能。

I know it's not quite what you were looking for, but could you get them to re-save the Word documents in .odt? 我知道这不是你想要的,但是你可以让他们在.odt中重新保存Word文档吗?

This article could help if you get to that stage: reading odt files in php 如果你进入那个阶段,这篇文章会有所帮助: 在php中读取odt文件

Here you go :) 干得好 :)

$zip = new ZipArchive;
$zip->open("MyFile.docx");
if (($index = $zip->locateName("word/document.xml")) !== false) {
    $text = $zip->getFromIndex($index);
    $xml = DOMDocument::loadXML($text, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
    echo $xml->saveXML();
}
$zip->close();

if you need to install the zip ext you can find it here: http://php.net/manual/en/zip.installation.php 如果你需要安装zip扩展,你可以在这里找到它: http//php.net/manual/en/zip.installation.php

hope it helps you along! 希望它能帮到你!

I don't have a direct answer, but my preference is to break a complex problem like this into simpler pieces. 我没有直接的答案,但我倾向于将这样一个复杂的问题分解为更简单的部分。

The approach I would use is to open it in Word (or in OpenOffice or LibreOffice) and save as HTML. 我将使用的方法是在Word(或OpenOffice或LibreOffice)中打开它并保存为HTML。 Then I would prepend a instruction and read it with one of the many XML classes/extensions available in PHP. 然后我会添加一条指令并使用PHP中提供的许多XML类/扩展之一来读取它。

[I found this question because I was Googling for a framework that would let me go through the HTML that Word generates and clean it up -- turn it into legal XHTML1.0, remove the useless style information that Word creates, while preserving my user-generated styles, etc. That second will require some experimentation to determine what I want to keep and what I want to discard, but I think that is well within my hobbyist capabilities.] [我发现了这个问题,因为我正在谷歌搜索一个框架,让我通过Word生成并清理它的HTML - 将其转换为合法的XHTML1.0,删除Word创建的无用样式信息,同时保留我的用户生成的样式等。第二个需要一些实验来确定我想要保留什么以及我想要丢弃什么,但我认为这完全符合我的爱好者能力。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM