简体   繁体   中英

saving unknown files with curl w/ PHP 5.3.x

I'm trying to archive a web base forum that has attachments that users have posted. So far, I made use of the php cURL library to get the individual topics and have been able to save the raw pages. However, I now need to figure out a way to archive the attachments that are located on the site.

Here is the problem: Since the file type is not consistent, I need to find a way to save the files with the correct extension. Note that I plan to rename the file when I save it so that it's organized in a way that it can be easily found later.

The link to the attached files in a page is in the format:

<a href="https://example.com/get_file?fileId=4342343212223">some file.txt</a>

I've already used preg_match() to get the URL's to the attached files. My biggest problem now is now just making sure the fetched file is saved in the correct format.

My question: Is there any way to get the file type efficiently? I'd rather not have to use a regular expression, but I'm not seeing any other way.

Does the server add the correct Content-Type header field when serving the files? You can then intercept that with setting CURLOPT_HEADER or file_get_contents + $http_response_header .

http://www.php.net/manual/en/reserved.variables.httpresponseheader.php

i would look into

http://www.php.net/manual/en/book.fileinfo.php

to see if you can automatically grab the file type when you get ahold of it.

you can use DOMDocument and DOMXpath to extract urls and filename safely.

$doc=new DOMDocument();
$doc->loadHTML($content);
$xpath= new DOMXpath($doc);
//query examples:
foreach($xpath->query('//a') as $node)
    echo $node->nodeValue;
foreach($xpath->query('//a/@href') as $node)
    echo $node->nodeValue;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM