saving unknown files with curl w/ PHP 5.3.x

Question

I'm trying to archive a web base forum that has attachments that users have posted. So far, I made use of the php cURL library to get the individual topics and have been able to save the raw pages. However, I now need to figure out a way to archive the attachments that are located on the site.

Here is the problem: Since the file type is not consistent, I need to find a way to save the files with the correct extension. Note that I plan to rename the file when I save it so that it's organized in a way that it can be easily found later.

The link to the attached files in a page is in the format:

<a href="https://example.com/get_file?fileId=4342343212223">some file.txt</a>

I've already used preg_match() to get the URL's to the attached files. My biggest problem now is now just making sure the fetched file is saved in the correct format.

My question: Is there any way to get the file type efficiently? I'd rather not have to use a regular expression, but I'm not seeing any other way.

Answer 1

Does the server add the correct Content-Type header field when serving the files? You can then intercept that with setting CURLOPT_HEADER or file_get_contents + $http_response_header .

http://www.php.net/manual/en/reserved.variables.httpresponseheader.php

Answer 2

i would look into

http://www.php.net/manual/en/book.fileinfo.php

to see if you can automatically grab the file type when you get ahold of it.

Answer 3

you can use DOMDocument and DOMXpath to extract urls and filename safely.

$doc=new DOMDocument();
$doc->loadHTML($content);
$xpath= new DOMXpath($doc);
//query examples:
foreach($xpath->query('//a') as $node)
    echo $node->nodeValue;
foreach($xpath->query('//a/@href') as $node)
    echo $node->nodeValue;

saving unknown files with curl w/ PHP 5.3.x

Question

3 answers

solution1
1 2011-06-10 05:36:59

solution2
0 2011-06-10 03:42:27

solution3
0 2011-06-10 03:53:20

saving unknown files with curl w/ PHP 5.3.x

Question

3 answers

solution1 1 2011-06-10 05:36:59

solution2 0 2011-06-10 03:42:27

solution3 0 2011-06-10 03:53:20

solution1
1 2011-06-10 05:36:59

solution2
0 2011-06-10 03:42:27

solution3
0 2011-06-10 03:53:20