简体   繁体   English

在php中读取DOC文件

[英]Reading DOC file in php

I'm trying to read .doc .docx file in php.我正在尝试在 php 中读取.doc .docx文件。 All is working fine.一切正常。 But at last line I'm getting awful characters.但在最后一行我得到了可怕的角色。 Please help me.请帮我。 Here is code which is developed by someone.这是由某人开发的代码。

    function parseWord($userDoc) 
{
    $fileHandle = fopen($userDoc, "r");
    $line = @fread($fileHandle, filesize($userDoc));   
    $lines = explode(chr(0x0D),$line);
    $outtext = "";
    foreach($lines as $thisline)
      {
        $pos = strpos($thisline, chr(0x00));
        if (($pos !== FALSE)||(strlen($thisline)==0))
          {
          } else {
            $outtext .= $thisline." ";
          }
      }
     $outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);
    return $outtext;
} 

$userDoc = "k.doc";

Here is screenshot.这是屏幕截图。在此处输入图片说明

You can read .docx files in PHP but you can't read .doc files.您可以在 PHP 中读取 .docx 文件,但无法读取 .doc 文件。 Here is the code to read .docx files:这是读取 .docx 文件的代码:

function read_file_docx($filename){

    $striped_content = '';
    $content = '';

    if(!$filename || !file_exists($filename)) return false;

    $zip = zip_open($filename);

    if (!$zip || is_numeric($zip)) return false;

    while ($zip_entry = zip_read($zip)) {

        if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

        if (zip_entry_name($zip_entry) != "word/document.xml") continue;

        $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

        zip_entry_close($zip_entry);
    }// end while

    zip_close($zip);

    //echo $content;
    //echo "<hr>";
    //file_put_contents('1.xml', $content);

    $content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
    $content = str_replace('</w:r></w:p>', "\r\n", $content);
    $striped_content = strip_tags($content);

    return $striped_content;
}
$filename = "filepath";// or /var/www/html/file.docx

$content = read_file_docx($filename);
if($content !== false) {

    echo nl2br($content);
}
else {
    echo 'Couldn\'t the file. Please check that file.';
}

DOC files are not plain text . DOC 文件不是纯文本

Try a library such as PHPWord ( old CodePlex site ).尝试使用诸如PHPWord旧 CodePlex 站点)之类的库。

nb: This answer has been updated multiple times as PHPWord has changed hosting and functionality.注意:由于 PHPWord 已更改托管和功能,因此此答案已更新多次。

I am using this function working well for me :) try it我正在使用此功能对我来说效果很好:) 试试看

function read_doc_file($filename) {
     if(file_exists($filename))
    {
        if(($fh = fopen($filename, 'r')) !== false ) 
        {
           $headers = fread($fh, 0xA00);

           // 1 = (ord(n)*1) ; Document has from 0 to 255 characters
           $n1 = ( ord($headers[0x21C]) - 1 );

           // 1 = ((ord(n)-8)*256) ; Document has from 256 to 63743 characters
           $n2 = ( ( ord($headers[0x21D]) - 8 ) * 256 );

           // 1 = ((ord(n)*256)*256) ; Document has from 63744 to 16775423 characters
           $n3 = ( ( ord($headers[0x21E]) * 256 ) * 256 );

           // 1 = (((ord(n)*256)*256)*256) ; Document has from 16775424 to 4294965504 characters
           $n4 = ( ( ( ord($headers[0x21F]) * 256 ) * 256 ) * 256 );

           // Total length of text in the document
           $textLength = ($n1 + $n2 + $n3 + $n4);

           $extracted_plaintext = fread($fh, $textLength);

           // simple print character stream without new lines
           //echo $extracted_plaintext;

           // if you want to see your paragraphs in a new line, do this
           return nl2br($extracted_plaintext);
           // need more spacing after each paragraph use another nl2br
        }
    }   
    }

Decoding in pure PHP never worked for me, so here is my solution : http://wvware.sourceforge.net/用纯 PHP 解码从来没有对我有用,所以这是我的解决方案: http : //wvware.sourceforge.net/

Install package安装包

sudo apt-get install wv elinks

Use it in PHP :在 PHP 中使用它:

$output = str_replace('.doc', '.txt', $filename);
shell_exec('/usr/bin/wvText ' . $filename . ' ' . $output);
$text = file_get_contents($output);
# Convert to UTF-8 if needed
if(!mb_detect_encoding($text, 'UTF-8', true))
{
    $text = utf8_encode($text);
}
unlink($output);

I also used it but for accents ( and single quotes like ' ) it would put instead SOo my PDO mySQL didn't like it but I finally figured it out by adding我也用过它,但是对于重音(和像 ' 这样的单引号),它会用 代替,所以我的 PDO mySQL 不喜欢它,但我终于通过添加弄清楚了

mb_convert_encoding($extracted_plaintext,'UTF-8');

So the final version should read:所以最终版本应该是:

function getRawWordText($filename) {
    if(file_exists($filename)) {
        if(($fh = fopen($filename, 'r')) !== false ) {
            $headers = fread($fh, 0xA00);
            $n1 = ( ord($headers[0x21C]) - 1 );// 1 = (ord(n)*1) ; Document has from 0 to 255 characters
            $n2 = ( ( ord($headers[0x21D]) - 8 ) * 256 );// 1 = ((ord(n)-8)*256) ; Document has from 256 to 63743 characters
            $n3 = ( ( ord($headers[0x21E]) * 256 ) * 256 );// 1 = ((ord(n)*256)*256) ; Document has from 63744 to 16775423 characters
            $n4 = ( ( ( ord($headers[0x21F]) * 256 ) * 256 ) * 256 );// 1 = (((ord(n)*256)*256)*256) ; Document has from 16775424 to 4294965504 characters
            $textLength = ($n1 + $n2 + $n3 + $n4);// Total length of text in the document
            $extracted_plaintext = fread($fh, $textLength);
            $extracted_plaintext = mb_convert_encoding($extracted_plaintext,'UTF-8');
             // if you want to see your paragraphs in a new line, do this
             // return nl2br($extracted_plaintext);
             return ($extracted_plaintext);
        } else {
            return false;
        }
    } else {
        return false;
    }  
}

This works fine in a utf8_general_ci mySQL database to read word doc files :)这在 utf8_general_ci mySQL 数据库中可以很好地读取 word doc 文件:)

Hope this helps someone else希望这对其他人有帮助

I'm using soffice to convert doc to txt and read txt converted file我正在使用 soffice 将 doc 转换为 txt 并读取 txt 转换文件

soffice --convert-to txt test.doc

you can see more in here你可以在这里看到更多

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM