How to extract bold text from docx

Question

I want to extract bold text from word docx using php. I create docx zip file and extract. Then, read document.xml. In xml, presence of <w:b/> show that text is bold.

sample.docx:

Create zip and extract

<?php
  $docname="sample";
  echo copy($docname.".docx",$docname.".zip");

 $zip = new ZipArchive;
 if ($zip->open($docname.".zip") === TRUE) {
 $zip->extractTo($docname."/");
 $zip->close();
 } else {
  echo 'failed';
 }
?>

Extract bold word to array (Reference: search-bold

<?php
//https://www.jackreichert.com/2012/11/how-to-convert-docx-to-html/
$folder="sample";
$xmlFile = $folder."/word/document.xml";
$reader = new XMLReader;
$reader->open($xmlFile);
$bold_words=[];
while($reader->read()){
        if ($reader->nodeType == XMLREADER::ELEMENT && $reader->name === 'w:p'){  
        $paragraph = new XMLReader;
        $p = $reader->readOuterXML();
        $paragraph->xml($p);
        while ($paragraph->read()){
                if ($paragraph->nodeType == XMLREADER::ELEMENT && $paragraph->name === 'w:r'){
                $node = trim($paragraph->readInnerXML());
                //strstr() function searches for the first occurrence of a string inside another string
                if(strstr($node,'<w:b/>'))
                {
                    $bold_words[]=$node;
                }
             }
        }
    }
}
echo "<pre>";
var_dump($bold_words);
echo "</pre>";
?>

The result show:

array(1) {
          [0]=>string(364) "Title content"
         }

There should be 5 bold words shown in result, but, only has one. I have checked document.xml. <w:b/> only appear once.

How list text bold formatted in document.xml?

Answer 1

Your code works as expected. There might be problem with your XML. Check the codeblocks below and results:

<?php
//https://www.jackreichert.com/2012/11/how-to-convert-docx-to-html/
 $folder="sample";
 $xmlFile = "document.xml";
 $reader = new XMLReader;
 $reader->open($xmlFile);
 $bold_words=[];
 while($reader->read()){
         if ($reader->nodeType == XMLREADER::ELEMENT && $reader->name === 'w:p'){  
         $paragraph = new XMLReader;
         $p = $reader->readOuterXML();
         $paragraph->xml($p);
         while ($paragraph->read()){
                 if ($paragraph->nodeType == XMLREADER::ELEMENT && $paragraph->name === 'w:r'){
                 $node = trim($paragraph->readInnerXML());
                 //strstr() function searches for the first occurrence of a string inside another string
                 if(strstr($node,'<w:b/>'))
                 {
                     $bold_words[]=$node;
                 }
              }
         }
     }
 }
 echo "<pre>";
 var_dump($bold_words);
 echo "</pre>";
 ?>

XML File:

<?xml version="1.0" encoding="UTF-8"?>
<w:document xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
            xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"
            xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
            xmlns:w10="urn:schemas-microsoft-com:office:word"
            xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml"
            xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape"
            xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup"
            xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
            xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office"
            xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" mc:Ignorable="w14">
    <w:background w:color="FFFFFF"/>
    <w:body>
        <w:p>
            <w:pPr>
                <w:pStyle w:val="Body A"/>
            </w:pPr>
        </w:p>
        <w:p>
            <w:pPr>
                <w:pStyle w:val="Title"/>
                <w:rPr>
                    <w:sz w:val="24"/>
                    <w:szCs w:val="24"/>
                </w:rPr>
            </w:pPr>
            <w:r>
                <w:rPr>
                    <w:b/>
                    <w:sz w:val="24"/>
                    <w:szCs w:val="24"/>
                    <w:rtl w:val="0"/>
                    <w:lang w:val="it-IT"/>
                </w:rPr>
                <w:t>Hello World</w:t>
            </w:r>
        </w:p>
        <w:p>
            <w:pPr>
                <w:pStyle w:val="Default"/>
                <w:spacing w:line="280" w:lineRule="atLeast"/>
                <w:rPr>
                    <w:sz w:val="24"/>
                    <w:szCs w:val="24"/>
                </w:rPr>
            </w:pPr>
        </w:p>
        <w:p>
            <w:pPr>
                <w:pStyle w:val="Default"/>
                <w:spacing w:line="280" w:lineRule="atLeast"/>
            </w:pPr>
            <w:r>
                <w:rPr>
                    <w:b/>
                    <w:sz w:val="24"/>
                    <w:szCs w:val="24"/>
                    <w:rtl w:val="0"/>
                    <w:lang w:val="en-US"/>
                </w:rPr>
                <w:t xml:space="preserve">This is a </w:t>
            </w:r>
            <w:r>
                <w:rPr>
                    <w:b/>
                    <w:b w:val="1"/>
                    <w:bCs w:val="1"/>
                    <w:sz w:val="24"/>
                    <w:szCs w:val="24"/>
                    <w:rtl w:val="0"/>
                    <w:lang w:val="en-US"/>
                </w:rPr>
                <w:t>very short</w:t>
            </w:r>
            <w:r>
                <w:rPr>
                    <w:sz w:val="24"/>
                    <w:szCs w:val="24"/>
                    <w:rtl w:val="0"/>
                    <w:lang w:val="en-US"/>
                </w:rPr>
                <w:t xml:space="preserve"> paragraph. It only contains </w:t>
            </w:r>
            <w:r>
                <w:rPr>
                    <w:i w:val="1"/>
                    <w:iCs w:val="1"/>
                    <w:sz w:val="24"/>
                    <w:szCs w:val="24"/>
                    <w:rtl w:val="0"/>
                    <w:lang w:val="en-US"/>
                </w:rPr>
                <w:t>three</w:t>
            </w:r>
            <w:r>
                <w:rPr>
                    <w:sz w:val="24"/>
                    <w:szCs w:val="24"/>
                    <w:rtl w:val="0"/>
                    <w:lang w:val="en-US"/>
                </w:rPr>
                <w:t xml:space="preserve"> sentences. This is the </w:t>
            </w:r>
            <w:r>
                <w:rPr>
                    <w:sz w:val="24"/>
                    <w:szCs w:val="24"/>
                    <w:u w:val="single"/>
                    <w:rtl w:val="0"/>
                    <w:lang w:val="en-US"/>
                </w:rPr>
                <w:t>third sentence</w:t>
            </w:r>
            <w:r>
                <w:rPr>
                    <w:sz w:val="24"/>
                    <w:szCs w:val="24"/>
                    <w:rtl w:val="0"/>
                    <w:lang w:val="en-US"/>
                </w:rPr>
                <w:t>.</w:t>
            </w:r>
        </w:p>
        <w:sectPr>
            <w:headerReference w:type="default" r:id="rId4"/>
            <w:footerReference w:type="default" r:id="rId5"/>
            <w:pgSz w:w="12240" w:h="15840" w:orient="portrait"/>
            <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="864"/>
            <w:bidi w:val="0"/>
        </w:sectPr>
    </w:body>
</w:document>

Result:

array(3) {
  [0]=>
  string(403) "
                Hello World"
  [1]=>
  string(423) "
                This is a "
  [2]=>
  string(478) "
                very short"
}

How to extract bold text from docx

Question

1 answers

solution1
0 2022-08-18 08:16:42

How to extract bold text from docx

Question

1 answers

solution1 0 2022-08-18 08:16:42

solution1
0 2022-08-18 08:16:42