简体   繁体   中英

Is there any way to format converted docx file to text in php?

Hello I want to format in a nice way the converted doc or docx file to text in php? The code below is the class that I used to convert docx file to text.

class DocxConversion{
    private $filename;

    public function __construct($filePath) {
        $this->filename = $filePath;
    }

    private function read_doc() {
        $fileHandle = fopen($this->filename, "r");
        $line = @fread($fileHandle, filesize($this->filename));   
        $lines = explode(chr(0x0D),$line);
        $outtext = "";
        foreach($lines as $thisline)
          {
            $pos = strpos($thisline, chr(0x00));
            if (($pos !== FALSE)||(strlen($thisline)==0))
              {
              } else {
                $outtext .= $thisline." ";
              }
          }
         $outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);
        return $outtext;
    }

    private function read_docx(){

        $striped_content = '';
        $content = '';

        $zip = zip_open($this->filename);

        if (!$zip || is_numeric($zip)) return false;

        while ($zip_entry = zip_read($zip)) {

            if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

            if (zip_entry_name($zip_entry) != "word/document.xml") continue;

            $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

            zip_entry_close($zip_entry);
        }// end while

        zip_close($zip);

        $content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
        $content = str_replace('</w:r></w:p>', "\r\n", $content);
        $striped_content = strip_tags($content);

        return $striped_content;
    }

 /************************excel sheet************************************/

function xlsx_to_text($input_file){
    $xml_filename = "xl/sharedStrings.xml"; //content file name
    $zip_handle = new ZipArchive;
    $output_text = "";
    if(true === $zip_handle->open($input_file)){
        if(($xml_index = $zip_handle->locateName($xml_filename)) !== false){
            $xml_datas = $zip_handle->getFromIndex($xml_index);
            $xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
            $output_text = strip_tags($xml_handle->saveXML());
        }else{
            $output_text .="";
        }
        $zip_handle->close();
    }else{
    $output_text .="";
    }
    return $output_text;
}

/*************************power point files*****************************/
function pptx_to_text($input_file){
    $zip_handle = new ZipArchive;
    $output_text = "";
    if(true === $zip_handle->open($input_file)){
        $slide_number = 1; //loop through slide files
        while(($xml_index = $zip_handle->locateName("ppt/slides/slide".$slide_number.".xml")) !== false){
            $xml_datas = $zip_handle->getFromIndex($xml_index);
            $xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
            $output_text .= strip_tags($xml_handle->saveXML());
            $slide_number++;
        }
        if($slide_number == 1){
            $output_text .="";
        }
        $zip_handle->close();
    }else{
    $output_text .="";
    }
    return $output_text;
}


    public function convertToText() {

        if(isset($this->filename) && !file_exists($this->filename)) {
            return "File Not exists";
        }

        $fileArray = pathinfo($this->filename);
        $file_ext  = $fileArray['extension'];
        if($file_ext == "doc" || $file_ext == "docx" || $file_ext == "xlsx" || $file_ext == "pptx")
        {
            if($file_ext == "doc") {
                return $this->read_doc();
            } elseif($file_ext == "docx") {
                return $this->read_docx();
            } elseif($file_ext == "xlsx") {
                return $this->xlsx_to_text();
            }elseif($file_ext == "pptx") {
                return $this->pptx_to_text();
            }
        } else {
            return "Invalid File Type";
        }
    }

}

The sample output is just like a text with no spacing, no indention, no heading and more. This is a sample output that comes when I convert it successfully.

ENTOURAGE LIST Officiating Pastor Pastor Ron EgeGroom's Parents Mr. Mario Cabunoc Jr.Mrs. Susana CabunocBride's Parents: Mr. Edilberto Marucut (Deceased)Mrs. Yolanda MarucutPrincipal Sponsors: Capt. Nemesio Desales III Mr. Edwin GinesMr. Valentino CabunocMr. Felipe MarucutMr.Nilo CabunocMr. Froilan Dulce Mr. Jose Fabie CabunocMr. Ramon Navarro Mr. Alfonso Fernandez Mr. Isagani CabunocMr. Allan CabunocMr.Julius OrpillaMrs.Rhodora DesalesMrs. Clarita Alonzo Mrs. Niña CabunocMrs. Robelita Ana Mrs. Marife CabunocMrs. Juvy Dulce Mrs. Imelda de GuiaMs. Imelda FuraggananMrs.Madamoiselle Granada Mrs. Mayeth Hidalgo Mrs. Analyn Cabida Mrs. Luz Ignacio Best Man Mario Cabunoc III Maid of Honor Marivic MarucutGroomsman Warren Van CabunocBridesmaid Cristhel Joy CabunocSecondary Sponsors Candles Christian Paulo DivinaAlanis Joyce AlbisoVeil Vincent Allen FernandezShiela May CabunocCord Kurt Jayson AlbisoCyrille Allyssa LimpinCoin Bearer Achilles Ronil Rain FacunlaBible Bearer Ralph Jacob Dulce Ring Bearer Caleb Joshua MarucutFlowergirlsShekinah Irish CabunocYurie Ysabelle MarucutElisha Bernice Cajandig

This text below is format I want to do just like in the docx file.

ENTOURAGE LIST

Officiating Pastor Pastor Ron Ege

Groom's Parents Mr. Mario Cabunoc Jr. Mrs. Susana Cabunoc Bride's Parents: Mr. Edilberto Marucut (Deceased) Mrs. Yolanda Marucut

Principal Sponsors:

Capt. Nemesio Desales III Mr. Edwin Gines Mr. Valentino Cabunoc Mr. Felipe Marucut Mr. Nilo Cabunoc Mr. Froilan Dulce Mr. Jose Fabie Cabunoc Mr. Ramon Navarro Mr. Alfonso Fernandez Mr. Isagani Cabunoc Mr. Allan Cabunoc Mr. Julius Orpilla Mrs. Rhodora Desales Mrs. Clarita Alonzo Mrs. Niña Cabunoc Mrs. Robelita Ana Mrs. Marife Cabunoc Mrs. Juvy Dulce Mrs. Imelda de Guia Ms. Imelda Furagganan Mrs. Madamoiselle Granada Mrs. Mayeth Hidalgo Mrs. Analyn Cabida Mrs. Luz Ignacio

I want to edit the format ouput of the docx file in php. Can anyone help? Thank you in advance!!

This

$content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
$content = str_replace('</w:r></w:p>', "\r\n", $content);
$striped_content = strip_tags($content);

removes all p and br

You need to protect them first, and then replace

$content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
$content = str_replace('</w:r></w:p>', "#BR#", $content);
$content = str_replace('<w:br/>', "#BR#", $content);
$content = str_replace('<w:jc w:val="center"/>', "#BR##BR#", $content);
$content = str_replace('<w:jc w:val="both"/>', "#BR##BR#", $content);
$content = str_replace('<w:jc w:val="left"/>', "#BR##BR#", $content);
$content = str_replace('<w:jc w:val="right"/>', "#BR##BR#", $content);
$striped_content = strip_tags($content);
$striped_content = str_replace("#BR#","\r\n", $striped_content);

I use separated replace for each kind of paragraph because you also may want
to replace "<w:jc w:val="center"/>" with something like "#CENTER#" and then replace it with extra spaces. How I did. If no, better join them in one.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM