简体   繁体   中英

file_get_contents() returning invalid characters for uploaded word document

I'm trying to get the first 1,000 characters from an uploaded text file. I'm doing:

if($file->simpletype=="document"){
    //get first 1000 chars in here
    $snippet = file_get_contents($_FILES['upload']['tmp_name'], false, null, -1, 1000);
    file_put_contents('/var/www/my_logs/log.log', $snippet);
    $file->snippet = $snippet;
}

This works fine for a .txt file and I can open and read the log.log file with gedit. However for .doc , .docx , .odt and .pdf files, file_get_contents() returns gibberish such as: PK\\00\\00\\00\\

I have tried another solution I found on stackoverflow:

function file_get_contents_utf8() {
    $content = file_get_contents($_FILES['upload']['tmp_name'], false, null, -1, 1000);
    return mb_convert_encoding($content, 'UTF-8',
             mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
}

But I get the same results. Any ideas? Thanks!

You are trying to read text from files that don't use plain text formatting.

To read doc/docx files, you will need to use a tool like PHPDocx or http://phpword.codeplex.com .

For parsing PDFs, refer to the answer to this question .

This will never work with non plain text files. You need to get plain text from doc/pdf/odt documents first and then you can manipulate that text. Simply open any of these documents in simple text editor like Notepad and see their contents.

For Word documents you may start with http://phpword.codeplex.com/ . Also look for other libraries which you can use to extract contents from these files.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM