使用 curl w/ PHP 5.3.x 保存未知文件

Question

I'm trying to archive a web base forum that has attachments that users have posted.我正在尝试归档一个 web 基础论坛，其中包含用户发布的附件。 So far, I made use of the php cURL library to get the individual topics and have been able to save the raw pages.到目前为止，我使用 php cURL 库来获取各个主题并能够保存原始页面。 However, I now need to figure out a way to archive the attachments that are located on the site.但是，我现在需要找到一种方法来归档网站上的附件。

Here is the problem: Since the file type is not consistent, I need to find a way to save the files with the correct extension.问题出在这里：由于文件类型不一致，我需要找到一种方法以正确的扩展名保存文件。 Note that I plan to rename the file when I save it so that it's organized in a way that it can be easily found later.请注意，我计划在保存文件时重命名文件，以便以后可以轻松找到它。

The link to the attached files in a page is in the format:页面中附加文件的链接格式为：

<a href="https://example.com/get_file?fileId=4342343212223">some file.txt</a>

I've already used preg_match() to get the URL's to the attached files.我已经使用 preg_match() 来获取附加文件的 URL。 My biggest problem now is now just making sure the fetched file is saved in the correct format.我现在最大的问题是确保获取的文件以正确的格式保存。

My question: Is there any way to get the file type efficiently?我的问题：有什么方法可以有效地获取文件类型？ I'd rather not have to use a regular expression, but I'm not seeing any other way.我宁愿不必使用正则表达式，但我没有看到任何其他方式。

Answer 1

Does the server add the correct Content-Type header field when serving the files?服务器在提供文件时是否添加了正确的Content-Type header 字段？ You can then intercept that with setting CURLOPT_HEADER or file_get_contents + $http_response_header .然后，您可以通过设置CURLOPT_HEADER或file_get_contents + $http_response_header来拦截它。

http://www.php.net/manual/en/reserved.variables.httpresponseheader.php http://www.php.net/manual/en/reserved.variables.httpresponseheader.php

Answer 2

i would look into我会调查

http://www.php.net/manual/en/book.fileinfo.php http://www.php.net/manual/en/book.fileinfo.php

to see if you can automatically grab the file type when you get ahold of it.看看您是否可以在获取文件类型时自动获取文件类型。

Answer 3

you can use DOMDocument and DOMXpath to extract urls and filename safely.您可以使用 DOMDocument 和 DOMXpath 安全地提取 url 和文件名。

$doc=new DOMDocument();
$doc->loadHTML($content);
$xpath= new DOMXpath($doc);
//query examples:
foreach($xpath->query('//a') as $node)
    echo $node->nodeValue;
foreach($xpath->query('//a/@href') as $node)
    echo $node->nodeValue;

使用 curl w/ PHP 5.3.x 保存未知文件

问题描述

3 个解决方案

解决方案1
1 2011-06-10 05:36:59

解决方案2
0 2011-06-10 03:42:27

解决方案3
0 2011-06-10 03:53:20

使用 curl w/ PHP 5.3.x 保存未知文件

问题描述

3 个解决方案

解决方案1 1 2011-06-10 05:36:59

解决方案2 0 2011-06-10 03:42:27

解决方案3 0 2011-06-10 03:53:20

解决方案1
1 2011-06-10 05:36:59

解决方案2
0 2011-06-10 03:42:27

解决方案3
0 2011-06-10 03:53:20