[英]reading MS word file in php
I read MS word files by first converting it to zip and then getting its XML. 我先读取MS Word文件,方法是将其转换为zip,然后获取其XML。
But it removes the newline characters and it is bothering me. 但是它删除了换行符,这很困扰我。 what should I do? 我该怎么办?
I use this code: 我使用以下代码:
function get_docx_content($filename) {
//Check for extension
$ext = end(explode('.', $filename));
//if its docx file
if($ext == 'docx')
$dataFile = "word/document.xml";
//else it must be odt file
else
$dataFile = "content.xml";
//Create a new ZIP archive object
$zip = new ZipArchive;
// Open the archive file
if (true === $zip->open($filename)) {
// If successful, search for the data file in the archive
if (($index = $zip->locateName($dataFile)) !== false) {
// Index found! Now read it to a string
$text = $zip->getFromIndex($index);
// Load XML from a string
// Ignore errors and warnings
$xml = DOMDocument::loadXML($text, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
// Remove XML formatting tags and return the text
return strip_tags($xml->saveXML());
}
//Close the archive file
$zip->close();
}
}
Try This if it helps you. 如果可以,请尝试使用此功能。
<?php
/*****************************************************************
This approach uses detection of NUL (chr(00)) and end line (chr(13))
to decide where the text is:
- divide the file contents up by chr(13)
- reject any slices containing a NUL
- stitch the rest together again
- clean up with a regular expression
*****************************************************************/
function parseWord($userDoc)
{
$fileHandle = fopen($userDoc, "r");
$line = @fread($fileHandle, filesize($userDoc));
$lines = explode(chr(0x0D),$line);
$outtext = "";
foreach($lines as $thisline)
{
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE)||(strlen($thisline)==0))
{
} else {
$outtext .= $thisline." ";
}
}
$outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);
return $outtext;
}
$userDoc = "cv.doc";
$text = parseWord($userDoc);
echo $text;
?>
If you want more for study it then look into 如果您想学习更多,请查看
http://www.blogs.zeenor.com/it/read-ms-word-docx-ms-word-2007-file-document-using-php.html http://www.blogs.zeenor.com/it/read-ms-word-docx-ms-word-2007-file-document-using-php.html
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.