I want to extract bold text from word docx using php. I create docx zip file and extract. Then, read document.xml. In xml, presence of <w:b/>
show that text is bold.
sample.docx:
Create zip and extract
<?php
$docname="sample";
echo copy($docname.".docx",$docname.".zip");
$zip = new ZipArchive;
if ($zip->open($docname.".zip") === TRUE) {
$zip->extractTo($docname."/");
$zip->close();
} else {
echo 'failed';
}
?>
Extract bold word to array (Reference: search-bold
<?php
//https://www.jackreichert.com/2012/11/how-to-convert-docx-to-html/
$folder="sample";
$xmlFile = $folder."/word/document.xml";
$reader = new XMLReader;
$reader->open($xmlFile);
$bold_words=[];
while($reader->read()){
if ($reader->nodeType == XMLREADER::ELEMENT && $reader->name === 'w:p'){
$paragraph = new XMLReader;
$p = $reader->readOuterXML();
$paragraph->xml($p);
while ($paragraph->read()){
if ($paragraph->nodeType == XMLREADER::ELEMENT && $paragraph->name === 'w:r'){
$node = trim($paragraph->readInnerXML());
//strstr() function searches for the first occurrence of a string inside another string
if(strstr($node,'<w:b/>'))
{
$bold_words[]=$node;
}
}
}
}
}
echo "<pre>";
var_dump($bold_words);
echo "</pre>";
?>
The result show:
array(1) {
[0]=>string(364) "Title content"
}
There should be 5 bold words shown in result, but, only has one. I have checked document.xml. <w:b/>
only appear once.
How list text bold formatted in document.xml?
Your code works as expected. There might be problem with your XML. Check the codeblocks below and results:
<?php
//https://www.jackreichert.com/2012/11/how-to-convert-docx-to-html/
$folder="sample";
$xmlFile = "document.xml";
$reader = new XMLReader;
$reader->open($xmlFile);
$bold_words=[];
while($reader->read()){
if ($reader->nodeType == XMLREADER::ELEMENT && $reader->name === 'w:p'){
$paragraph = new XMLReader;
$p = $reader->readOuterXML();
$paragraph->xml($p);
while ($paragraph->read()){
if ($paragraph->nodeType == XMLREADER::ELEMENT && $paragraph->name === 'w:r'){
$node = trim($paragraph->readInnerXML());
//strstr() function searches for the first occurrence of a string inside another string
if(strstr($node,'<w:b/>'))
{
$bold_words[]=$node;
}
}
}
}
}
echo "<pre>";
var_dump($bold_words);
echo "</pre>";
?>
XML File:
<?xml version="1.0" encoding="UTF-8"?>
<w:document xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
xmlns:w10="urn:schemas-microsoft-com:office:word"
xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml"
xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape"
xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup"
xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" mc:Ignorable="w14">
<w:background w:color="FFFFFF"/>
<w:body>
<w:p>
<w:pPr>
<w:pStyle w:val="Body A"/>
</w:pPr>
</w:p>
<w:p>
<w:pPr>
<w:pStyle w:val="Title"/>
<w:rPr>
<w:sz w:val="24"/>
<w:szCs w:val="24"/>
</w:rPr>
</w:pPr>
<w:r>
<w:rPr>
<w:b/>
<w:sz w:val="24"/>
<w:szCs w:val="24"/>
<w:rtl w:val="0"/>
<w:lang w:val="it-IT"/>
</w:rPr>
<w:t>Hello World</w:t>
</w:r>
</w:p>
<w:p>
<w:pPr>
<w:pStyle w:val="Default"/>
<w:spacing w:line="280" w:lineRule="atLeast"/>
<w:rPr>
<w:sz w:val="24"/>
<w:szCs w:val="24"/>
</w:rPr>
</w:pPr>
</w:p>
<w:p>
<w:pPr>
<w:pStyle w:val="Default"/>
<w:spacing w:line="280" w:lineRule="atLeast"/>
</w:pPr>
<w:r>
<w:rPr>
<w:b/>
<w:sz w:val="24"/>
<w:szCs w:val="24"/>
<w:rtl w:val="0"/>
<w:lang w:val="en-US"/>
</w:rPr>
<w:t xml:space="preserve">This is a </w:t>
</w:r>
<w:r>
<w:rPr>
<w:b/>
<w:b w:val="1"/>
<w:bCs w:val="1"/>
<w:sz w:val="24"/>
<w:szCs w:val="24"/>
<w:rtl w:val="0"/>
<w:lang w:val="en-US"/>
</w:rPr>
<w:t>very short</w:t>
</w:r>
<w:r>
<w:rPr>
<w:sz w:val="24"/>
<w:szCs w:val="24"/>
<w:rtl w:val="0"/>
<w:lang w:val="en-US"/>
</w:rPr>
<w:t xml:space="preserve"> paragraph. It only contains </w:t>
</w:r>
<w:r>
<w:rPr>
<w:i w:val="1"/>
<w:iCs w:val="1"/>
<w:sz w:val="24"/>
<w:szCs w:val="24"/>
<w:rtl w:val="0"/>
<w:lang w:val="en-US"/>
</w:rPr>
<w:t>three</w:t>
</w:r>
<w:r>
<w:rPr>
<w:sz w:val="24"/>
<w:szCs w:val="24"/>
<w:rtl w:val="0"/>
<w:lang w:val="en-US"/>
</w:rPr>
<w:t xml:space="preserve"> sentences. This is the </w:t>
</w:r>
<w:r>
<w:rPr>
<w:sz w:val="24"/>
<w:szCs w:val="24"/>
<w:u w:val="single"/>
<w:rtl w:val="0"/>
<w:lang w:val="en-US"/>
</w:rPr>
<w:t>third sentence</w:t>
</w:r>
<w:r>
<w:rPr>
<w:sz w:val="24"/>
<w:szCs w:val="24"/>
<w:rtl w:val="0"/>
<w:lang w:val="en-US"/>
</w:rPr>
<w:t>.</w:t>
</w:r>
</w:p>
<w:sectPr>
<w:headerReference w:type="default" r:id="rId4"/>
<w:footerReference w:type="default" r:id="rId5"/>
<w:pgSz w:w="12240" w:h="15840" w:orient="portrait"/>
<w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="864"/>
<w:bidi w:val="0"/>
</w:sectPr>
</w:body>
</w:document>
Result:
array(3) {
[0]=>
string(403) "
Hello World"
[1]=>
string(423) "
This is a "
[2]=>
string(478) "
very short"
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.