file_get_contents() returning invalid characters for uploaded word document

Question

I'm trying to get the first 1,000 characters from an uploaded text file. I'm doing:

if($file->simpletype=="document"){
    //get first 1000 chars in here
    $snippet = file_get_contents($_FILES['upload']['tmp_name'], false, null, -1, 1000);
    file_put_contents('/var/www/my_logs/log.log', $snippet);
    $file->snippet = $snippet;
}

This works fine for a .txt file and I can open and read the log.log file with gedit. However for .doc , .docx , .odt and .pdf files, file_get_contents() returns gibberish such as: PK\\00\\00\\00\\

I have tried another solution I found on stackoverflow:

function file_get_contents_utf8() {
    $content = file_get_contents($_FILES['upload']['tmp_name'], false, null, -1, 1000);
    return mb_convert_encoding($content, 'UTF-8',
             mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
}

But I get the same results. Any ideas? Thanks!

Answer 1

You are trying to read text from files that don't use plain text formatting.

To read doc/docx files, you will need to use a tool like PHPDocx or http://phpword.codeplex.com .

For parsing PDFs, refer to the answer to this question .

Answer 2

This will never work with non plain text files. You need to get plain text from doc/pdf/odt documents first and then you can manipulate that text. Simply open any of these documents in simple text editor like Notepad and see their contents.

For Word documents you may start with http://phpword.codeplex.com/ . Also look for other libraries which you can use to extract contents from these files.

file_get_contents() returning invalid characters for uploaded word document

Question

2 answers

solution1
1 ACCPTED 2013-05-23 11:49:19

solution2
0 2013-05-23 11:48:52

file_get_contents() returning invalid characters for uploaded word document

Question

2 answers

solution1 1 ACCPTED 2013-05-23 11:49:19

solution2 0 2013-05-23 11:48:52

solution1
1 ACCPTED 2013-05-23 11:49:19

solution2
0 2013-05-23 11:48:52