簡體   English   中英

如何使用 php 將 docx 文檔轉換為 html?

[英]How can I convert a docx document to html using php?

我希望能夠上傳一個 MS word 文檔並將其導出到我的站點中的一個頁面。

有什么辦法可以做到這一點?

//FUNCTION :: read a docx file and return the string
function readDocx($filePath) {
    // Create new ZIP archive
    $zip = new ZipArchive;
    $dataFile = 'word/document.xml';
    // Open received archive file
    if (true === $zip->open($filePath)) {
        // If done, search for the data file in the archive
        if (($index = $zip->locateName($dataFile)) !== false) {
            // If found, read it to the string
            $data = $zip->getFromIndex($index);
            // Close archive file
            $zip->close();
            // Load XML from a string
            // Skip errors and warnings
            $xml = DOMDocument::loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
            // Return data without XML formatting tags

            $contents = explode('\n',strip_tags($xml->saveXML()));
            $text = '';
            foreach($contents as $i=>$content) {
                $text .= $contents[$i];
            }
            return $text;
        }
        $zip->close();
    }
    // In case of failure return empty string
    return "";
}

ZipArchiveDOMDocument都在 PHP 中,因此您不需要安裝/包含/需要其他庫。

可以使用PHPDocX

它支持幾乎所有的 HTML CSS 樣式。 此外,您可以使用模板通過replaceTemplateVariableByHTML為 HTML 添加額外的格式。

PHPDocX 的 HTML 方法還允許直接使用 Word 樣式。 你可以使用這樣的東西:

$docx->embedHTML($myHTML, array('tableStyle' => 'MediumGrid3-accent5PHPDOCX'));

如果您希望所有表格都使用 MediumGrid3-accent5 Word 樣式。 embedHTML 方法及其模板版本 ( replaceTemplateVariableByHTML ) 保留了繼承性,這意味着您可以使用預定義的 Word 樣式並使用 CSS 覆蓋其任何屬性。

您還可以使用“JQuery 類型”選擇器提取 HTML 的選定部分。

這可能對您有幫助。 如何將Docx轉換為HTML

您可以使用 Print2flash 庫將 Word docx 文檔轉換為 html。 這是我客戶網站上的一段 PHP 摘錄,它將文檔轉換為 html:

include("const.php");
$p2fServ = new COM("Print2Flash4.Server2");
$p2fServ->DefaultProfile->DocumentType=HTML5;
$p2fServ->ConvertFile($wordfile,$htmlFile);

它將 $wordfile 變量中指定路徑的文檔轉換為 $htmlFile 變量指定的 html 頁面文件。 保留所有格式、超鏈接和圖表。 您可以通過Print2flash SDK 中的更完整示例一起獲取所需的 const.php 文件。

這是基於 David Lin 的回答的解決方法,上面刪除了 docx 的 xml 標簽中的“w:”,留下類似 Html 的標簽

    function readDocx($filePath) {
    // Create new ZIP archive
    $zip = new ZipArchive;
    $dataFile = 'word/document.xml';
    // Open received archive file
    if (true === $zip->open($filePath)) {
        // If done, search for the data file in the archive
        if (($index = $zip->locateName($dataFile)) !== false) {
            // If found, read it to the string
            $data = $zip->getFromIndex($index);
            // Close archive file
            $zip->close();
            // Load XML from a string
            // Skip errors and warnings
            $xml = new DOMDocument("1.0", "utf-8");
            $xml->loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING|LIBXML_PARSEHUGE);
            $xml->encoding = "utf-8";
            // Return data without XML formatting tags
            $output =  $xml->saveXML();
            $output = str_replace("w:","",$output);

            return $output;
        }
        $zip->close();
    }
    // In case of failure return empty string
    return "";
}

如果您不拒絕 REST API,那么您可以使用:

  • 阿帕奇蒂卡 是公認的文本提取 OSS 領導者
  • 如果您不想麻煩配置並想要現成的解決方案,您可以使用RawText ,但它不是免費的。

RawText 的示例代碼:

$result = $rawText -> parse($your_file)

好吧,我來得很晚,但我想我會發布這個來節省大家一些時間。 這是我整理的一些 php 代碼,不僅可以從 docx 中讀取文本,還可以讀取圖像,目前它不支持浮動圖像/文本,但到目前為止我所做的是向已經發布的內容邁進了一大步在這里 - 請注意,您需要將https://sharinggodslove.uk更新為您的域名。

<?php

class Docx_ws_imglnk {
    public $originalpath = '';
    public $extractedpath = '';
}

class Docx_ws_rel {
    public $Id = '';
    public $Target = '';
}

class Docx_ws_def {
    public $styleId = '';
    public $type = '';
    public $color = '000000';
}

class Docx_p_def {
    public $data = array();
    public $text = "";
}

class Docx_p_item {
    public $name = "";
    public $value = "";
    public $innerstyle = "";
    public $type = "text";
}

class Docx_reader {

    private $fileData = false;
    private $errors = array();
    public $rels = array();
    public $imglnks = array();
    public $styles = array();
    public $document = null;
    public $paragraphs = array();
    public $path = '';
    private $saveimgpath = 'docimages';

    public function __construct() {
    
    }

    private function load($file) {
        if (file_exists($file)) {
            $zip = new ZipArchive();
            $openedZip = $zip->open($file);
            if ($openedZip === true) {
            
                $this->path = $file;
            
                //read and save images
                for ( $i = 0; $i < $zip->numFiles; $i ++ ) {
                    $zip_element = $zip->statIndex( $i );
                    if ( preg_match( "([^\s]+(\.(?i)(jpg|jpeg|png|gif|bmp))$)", $zip_element['name'] ) ) {
                        $imglnk = new Docx_ws_imglnk;
                        $imglnk->originalpath = $zip_element['name'];
                        $imagename                   = explode( '/',   $zip_element['name'] );
                        $imagename                   = end( $imagename );
                        $imglnk->extractedpath = dirname( __FILE__ ) . '/' . $this->savepath . $imagename;
                
                        $putres = file_put_contents( $imglnk->extractedpath, $zip->getFromIndex( $i ));
                        $imglnk->extractedpath = str_replace('var/www/', 'https://sharinggodslove.uk/', $imglnk->extractedpath);
                        $imglnk->extractedpath = substr($imglnk->extractedpath, 1);
                    
                        array_push($this->imglnks, $imglnk);
                    }
                }
            
                //read relationships
                if (($styleIndex = $zip->locateName('word/_rels/document.xml.rels')) !== false) {
                    $stylesRels = $zip->getFromIndex($styleIndex);
                    $xml = simplexml_load_string($stylesRels);
                    $XMLTEXT = $xml->saveXML();
                    $doc = new DOMDocument();
                    $doc->loadXML($XMLTEXT);
                    foreach($doc->documentElement->childNodes as $childnode)
                    {
                        $nodename = $childnode->nodeName;
                   
                        if($childnode->hasAttributes())
                        {
                            $rel = new Docx_ws_rel;
                            for ($a = 0; $a < $childnode->attributes->count(); $a++)
                            {
                                $attrNode = $childnode->attributes->item($a);
                            
                                if (strcmp( $attrNode->nodeName, 'Id') == 0)
                                {
                                    $rel->Id = $attrNode->nodeValue;
                                }
                                if (strcmp( $attrNode->nodeName, 'Target') == 0)
                                {
                                    $rel->Target = $attrNode->nodeValue;
                                }
                            }
                            array_push($this->rels, $rel);
                        }
                    }
                }
            
                //attempt to load styles:
                if (($styleIndex = $zip->locateName('word/styles.xml')) !== false) {
                    $stylesXml = $zip->getFromIndex($styleIndex);
                    $xml = simplexml_load_string($stylesXml);
                    $XMLTEXT = $xml->saveXML();
                    $doc = new DOMDocument();
                    $doc->loadXML($XMLTEXT);
               
                    foreach($doc->documentElement->childNodes as $childnode)
                    {
                        $nodename = $childnode->nodeName;
                    
                        //get style
                        if (strcmp($nodename, "w:style") == 0)
                        {
                            $ws_def = new Docx_ws_def;
                            for ($a=0; $a < $childnode->attributes->count(); $a++ )
                            {
                                $item = $childnode->attributes->item($a);
                                //style id
                                if (strcmp($item->nodeName, "w:styleId") == 0)
                                {
                                    $ws_def->styleId = $item->nodeValue;
                                }
                            
                                //style type
                                if (strcmp($item->nodeName, "w:type") == 0)
                                {
                                    $ws_def->type = $item->nodeValue;
                                }
                            }
                        }
                        //push style to the array of styles
                        if (strcmp($ws_def->styleId, "") != 0 && strcmp($ws_def->type, "") != 0)
                        {
                            array_push($this->styles, $ws_def);
                        }
                    }
                }

                if (($index = $zip->locateName('word/document.xml')) !== false) {
                    $stylesDoc = $zip->getFromIndex($index);
                    $xml = simplexml_load_string($stylesDoc);
                    $XMLTEXT = $xml->saveXML();
                    $this->document = new DOMDocument();
                    $this->document->loadXML($XMLTEXT);
                }
                $zip->close();
            } else {
                switch($openedZip) {
                    case ZipArchive::ER_EXISTS:
                        $this->errors[] = 'File exists.';
                        break;
                    case ZipArchive::ER_INCONS:
                        $this->errors[] = 'Inconsistent zip file.';
                        break;
                    case ZipArchive::ER_MEMORY:
                        $this->errors[] = 'Malloc failure.';
                        break;
                    case ZipArchive::ER_NOENT:
                        $this->errors[] = 'No such file.';
                        break;
                    case ZipArchive::ER_NOZIP:
                        $this->errors[] = 'File is not a zip archive.';
                        break;
                    case ZipArchive::ER_OPEN:
                        $this->errors[] = 'Could not open file.';
                        break;
                    case ZipArchive::ER_READ:
                        $this->errors[] = 'Read error.';
                        break;
                    case ZipArchive::ER_SEEK:
                        $this->errors[] = 'Seek error.';
                        break;
                }
            }
        } else {
            $this->errors[] = 'File does not exist.';
        }
    }

    public function setFile($path) {
        $this->fileData = $this->load($path);
    }

    public function to_plain_text() {
        if ($this->fileData) {
            return strip_tags($this->fileData);
        } else {
            return false;
        }
    }

    public function processDocument() {
        $html = '';    
    
        foreach($this->document->documentElement->childNodes as $childnode)
        {
            $nodename = $childnode->nodeName;
        
            //get the body of the document
            if (strcmp($nodename, "w:body") == 0)
            {
                foreach($childnode->childNodes as $subchildnode)
                {
                    $pnodename = $subchildnode->nodeName;
                
                    //process every paragraph
                    if (strcmp($pnodename, "w:p") == 0)
                    {
                        $pdef = new Docx_p_def;
                    
                        foreach($subchildnode->childNodes as $pchildnode)
                        {
                            //process any inner children
                            if (strcmp($pchildnode, "w:pPr") == 0)
                            {
                                foreach($pchildnode->childNodes as $prchildnode)
                                {
                                    //process text alignment
                                    if (strcmp($prchildnode->nodeName, "w:pStyle") == 0)
                                    {
                                        $pitem = new Docx_p_item;
                                        $pitem->name = 'styleId';
                                        $pitem->value = $prchildnode->attributes->getNamedItem('val')->nodeValue;
                                        array_push($pdef->data, $pitem);
                                    }
                                
                                    //process text alignment
                                    if (strcmp($prchildnode->nodeName, "w:jc") == 0)
                                    {
                                        $pitem = new Docx_p_item;
                                        $pitem->name = 'align';
                                        $pitem->value = $prchildnode->attributes->getNamedItem('val')->nodeValue;
                                    
                                        if (strcmp($pitem->value, "left") == 0)
                                        {
                                            $pitem->innerstyle .= "text-align:" . $pitem->value . ";";
                                        }
                                    
                                        if (strcmp($pitem->value, "center") == 0)
                                        {
                                            $pitem->innerstyle .= "text-align:" . $pitem->value . ";";
                                        }
                                    
                                        if (strcmp($pitem->value, "right") == 0)
                                        {
                                            $pitem->innerstyle .= "text-align:" . $pitem->value . ";";
                                        }
                                    
                                        if (strcmp($pitem->value, "both") == 0)
                                        {
                                            $pitem->innerstyle .= "word-spacing:" . 10 . "px;";
                                        }
                                    
                                        array_push($pdef->data, $pitem);
                                    }
                                
                                    //process drawing
                                    if (strcmp($prchildnode->nodeName, "w:drawing") == 0)
                                    {
                                        $pitem = new Docx_p_item;
                                        $pitem->name = 'drawing';
                                        $pitem->value = '';
                                        $pitem->type = 'graphic';
                                    
                                        $extents = $prchildnode->getElementsByTagName('extent')[0];
                                        $cx = $extents->attributes->getNamedItem('cx')->nodeValue;
                                        $cy = $extents->attributes->getNamedItem('cy')->nodeValue;
                                        $pcx = (int)$cx / 9525;
                                        $pcy = (int)$cy / 9525;
                                    
                                        $pitem->innerstyle .= "width:" . $pcx . "px;";
                                        $pitem->innerstyle .= "height:" . $pcy . "px;";
                                    
                                        $blip = $prchildnode->getElementsByTagName('blip')[0];
                                        $pitem->value = $blip->attributes->getNamedItem('embed')->nodeValue;
                                 
                                        array_push($pdef->data, $pitem);
                                    }
                                
                                    //process spacing
                                    if (strcmp($prchildnode->nodeName, "w:spacing") == 0)
                                    {
                                        $pitem = new Docx_p_item;
                                        $pitem->name = 'paragraphSpacing';
                                        $bval = $prchildnode->attributes->getNamedItem('before')->nodeValue;
                                        if (strcmp($bval, '') == 0)
                                            $bval = 0;
                                        $pitem->innerstyle .= "padding-top:" . $bval . "px;";
                                        $aval = $prchildnode->attributes->getNamedItem('after')->nodeValue;
                                        if (strcmp($aval, '') == 0)
                                            $aval = 0;
                                        $pitem->innerstyle .= "padding-bottom:" . $aval . "px;";
                                    
                                        array_push($pdef->data, $pitem);
                                    }
                                }
                            }
                        
                        
                            if (strcmp($pchildnode, "w:r") == 0)
                            {
                                foreach($pchildnode->childNodes as $rchildnode)
                                {
                                    //process text
                                    if (strcmp($rchildnode->nodeName, "w:t") == 0)
                                    {
                                        $pdef->text .= $rchildnode->nodeValue;
                                        if (count($pdef->data) == 0)
                                        {
                                            $pitem = new Docx_p_item;
                                            $pitem->name = 'styleId';
                                            $pitem->value = '';
                                            array_push($pdef->data, $pitem);
                                        }
                                    }
                                
                                    if (strcmp($rchildnode->nodeName, "w:rPr") == 0)
                                    {
                                        foreach($rchildnode->childNodes as $rPrchildnode)
                                        {
                                            if (strcmp($rPrchildnode->nodeName, "w:b") == 0 )
                                            {
                                                $pitem = new Docx_p_item;
                                                $pitem->name = 'textBold';
                                                $pitem->value = '';
                                                $pitem->innerstyle .= "text-weight: 500;";
                                                array_push($pdef->data, $pitem);
                                            }
                                            if (strcmp($rPrchildnode->nodeName, "w:i") == 0 )
                                            {
                                                $pitem = new Docx_p_item;
                                                $pitem->name = 'textItalic';
                                                $pitem->value = '';
                                                $pitem->innerstyle .= "text-style: italic;";
                                                array_push($pdef->data, $pitem);
                                            }
                                            if (strcmp($rPrchildnode->nodeName, "w:u") == 0 )
                                            {
                                                $pitem = new Docx_p_item;
                                                $pitem->name = 'textUnderline';
                                                $pitem->value = '';
                                                $pitem->innerstyle .= "text-decoration: underline;";
                                                array_push($pdef->data, $pitem);
                                            }
                                            if (strcmp($rPrchildnode->nodeName, "w:sz") == 0 )
                                            {
                                                $pitem = new Docx_p_item;
                                                $pitem->name = 'textSize';
                                            
                                                $sz = $rPrchildnode->attributes->getNamedItem('val')->nodeValue;
                                                if ($sz == '')
                                                {
                                                    $sz=0;
                                                }
                                                $pitem->value = $sz;
                                                array_push($pdef->data, $pitem);
                                            }
                                        }
                                    }
                                }
                            }
                        }
                  
                       array_push($this->paragraphs, $pdef);
                    }
                }
            }
        } 
    
    }

    public function to_html()
    {
        $html = '';
    
        foreach($this->paragraphs as $para)
        {
            $styleselect = null;
            $type = 'text';
            $content = $para->text;
            $sz = 0;
            $extent = '';
            $embedid = '';
        
            $pinnerstylesid = '';
            $pinnerstylesunderline = '';
            $pinnerstylessz = '';         
           
        
            if (count($para->data) > 0)
            {
                foreach($para->data as $node)
                {
                    if (strcmp($node->name, "styleId") == 0)
                    {
                        $type = $node->type;
                        $pinnerstylesid = $node->innerstyle;
                       
                        foreach($this->styles as $style)
                        {
                            if (strcmp ($node->value, $style->styleId) == 0)
                            {
                                $styleselect = $style;
                            }
                        }
                    }
                
                    if (strcmp($node->name, "align") == 0)
                    {
                        $pinnerstylesid .= $node->innerstyle. ";";
                    }
                
                    if (strcmp($node->name, "drawing") == 0)
                    {
                        $type = $node->type;
                        $extent = $node->innerstyle;
                        $embedid = $node->value;
                    }
                
                    if (strcmp($node->name, "textSize") == 0)
                    {
                        $sz = $node->value;
                    }
                
                    if (strcmp($node->name, "textUnderline") == 0)
                    {
                       $pinnerstylesunderline = $node->innerstyle;
                    }
                }
            }
     
           if (strcmp($type, 'text') == 0)
           {
                //echo "has valid para";
                //echo "<br>";
                if ($styleselect != null)
                {
                    //echo "has valid style";
                    //echo "<br>";
                
                    if (strcmp($styleselect->color, '') != 0)
                    {
                       $pinnerstylesid .= "color:#" . $styleselect->color. ";";
                    }
                }
            
                if ($sz != 0)
                {
                    $pinnerstylesid .= 'font-size:' . $sz . 'px;';
                    //echo "sz<br>";
                }
            
                $span =  "<p style='". $pinnerstylesid . $pinnerstylesunderline ."'>";
                $span .= $content;
                $span .= "</p>";
                //echo $span;
                $html .= $span;
            }
        
            if (strcmp($type, 'graphic') == 0)
            {
                $imglnk = '';
            
                foreach($this->rels as $rel)
                {
                    if(strcmp($embedid, '') != 0 && strcmp($rel->Id, $embedid) == 0)
                    {
                        foreach($this->imglnks as $imgpathdef)
                        {
                            if (strpos($imgpathdef->extractedpath, $rel->Target) >= 0)
                            {
                                $imglnk = $imgpathdef->extractedpath;
                                //echo "has img link<br>";
                                //echo $imglnk . "<br>";
                            }
                        }
                    }
                }
            
                if ($styleselect != null)
                {
                    //echo "has valid style";
                    //echo "<br>";
                
                    if (strcmp($styleselect->color, '') != 0)
                    {
                        $pinnerstylesid .= "color:#" . $styleselect->color. ";";
                    }
                }
            
                if ($sz != 0)
                {
                    $pinnerstylesid .= 'font-size:' . $sz . 'px;';
                    //echo "sz<br>";
                }
            
                $span =  "<p style='". $pinnerstylesid . $pinnerstylesunderline ."'>";
                $span .= "<img style='". $extent ."' alt='image coming soon' src ='". $imglnk ."'/>";
                $span .= "</p>";
                //echo $span;
                $html .= $span;
            }
           
        }
        return $html;
    }

    public function get_errors() {
        return $this->errors;
    }

    private function getStyles() {
    
    }

 }

 function getDocX($path)
 {
    //echo $path;
    $doc = new Docx_reader();
    $doc->setFile($path);

    if(!$doc->get_errors()) {
        $doc->processDocument();
        $html = $doc->to_html();
        echo $html;
    }
    return "";
}
?>

現在更常見的方法是使用 composer 包phpoffice/phpword ,這是一個純 php 庫,可以將任何辦公文檔轉換為 html,反之亦然,無需依賴。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM