简体   繁体   English

使用 curl w/ PHP 5.3.x 保存未知文件

[英]saving unknown files with curl w/ PHP 5.3.x

I'm trying to archive a web base forum that has attachments that users have posted.我正在尝试归档一个 web 基础论坛,其中包含用户发布的附件。 So far, I made use of the php cURL library to get the individual topics and have been able to save the raw pages.到目前为止,我使用 php cURL 库来获取各个主题并能够保存原始页面。 However, I now need to figure out a way to archive the attachments that are located on the site.但是,我现在需要找到一种方法来归档网站上的附件。

Here is the problem: Since the file type is not consistent, I need to find a way to save the files with the correct extension.问题出在这里:由于文件类型不一致,我需要找到一种方法以正确的扩展名保存文件。 Note that I plan to rename the file when I save it so that it's organized in a way that it can be easily found later.请注意,我计划在保存文件时重命名文件,以便以后可以轻松找到它。

The link to the attached files in a page is in the format:页面中附加文件的链接格式为:

<a href="https://example.com/get_file?fileId=4342343212223">some file.txt</a>

I've already used preg_match() to get the URL's to the attached files.我已经使用 preg_match() 来获取附加文件的 URL。 My biggest problem now is now just making sure the fetched file is saved in the correct format.我现在最大的问题是确保获取的文件以正确的格式保存。

My question: Is there any way to get the file type efficiently?我的问题:有什么方法可以有效地获取文件类型? I'd rather not have to use a regular expression, but I'm not seeing any other way.我宁愿不必使用正则表达式,但我没有看到任何其他方式。

Does the server add the correct Content-Type header field when serving the files?服务器在提供文件时是否添加了正确的Content-Type header 字段? You can then intercept that with setting CURLOPT_HEADER or file_get_contents + $http_response_header .然后,您可以通过设置CURLOPT_HEADERfile_get_contents + $http_response_header来拦截它。

http://www.php.net/manual/en/reserved.variables.httpresponseheader.php http://www.php.net/manual/en/reserved.variables.httpresponseheader.php

i would look into我会调查

http://www.php.net/manual/en/book.fileinfo.php http://www.php.net/manual/en/book.fileinfo.php

to see if you can automatically grab the file type when you get ahold of it.看看您是否可以在获取文件类型时自动获取文件类型。

you can use DOMDocument and DOMXpath to extract urls and filename safely.您可以使用 DOMDocument 和 DOMXpath 安全地提取 url 和文件名。

$doc=new DOMDocument();
$doc->loadHTML($content);
$xpath= new DOMXpath($doc);
//query examples:
foreach($xpath->query('//a') as $node)
    echo $node->nodeValue;
foreach($xpath->query('//a/@href') as $node)
    echo $node->nodeValue;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM