简体   繁体   中英

Using PHP, get headers of large file URL

I am using PHP to pull data from one of our sites to another using the database. Part of this is to move the files as I find them in the HTML.

One aspect of this needs to check to see if that file exists, and if it is not HTML (meaning there is an actual file sitting at the end of an .

Using get_headers takes a long time on a 2.2MB PDF. Trying to do the same using the following CURL request:

    public function getHeaders( $url ){
    $ch = curl_init();
    curl_setopt( $ch, CURLOPT_URL, $url );
    //curl_setopt( $ch, CURLOPT_RETURNTRANSFER, 1 );
    //curl_setopt( $ch, CURLOPT_VERBOSE, 0 );
    //curl_setopt( $ch, CURLOPT_HEADER, 1 );
    curl_setopt( $ch, CURLOPT_CUSTOMREQUEST, 'HEAD' );
    curl_exec( $ch );
    $info = curl_getinfo( $ch );
    curl_close( $ch );
    return $info;
}

The issue here is, that it too takes a long time (~20+ seconds) to bring back just the headers. Once I know it is a file and a 200, then I will go back and download it and insert it into my new database.

Any thoughts on how to just get the headers nice and quick? Thanks.

====== Edit 10:30a CDT 4/20/2015 ======

Example code doing the methods suggested:

<?php

//$file = 'http://www.pmi.org/Certification/~/media/PDF/Certifications/pdc_pmphandbook.ashx';
$file = 'https://www.projectmanagement-training.net/download/book_project_management.pdf';

print( 'Starting CURL Method : ' );
$time_start = microtime( true ); 
$headers = getHeaders( $file );
$execution_time = round( ( microtime( true ) - $time_start )/60, 8 );
print ( $execution_time . ' seconds <br />' );
print( '<pre>' . print_r( $headers, true ) . '</pre>' );



print( 'Starting get_headers() Method : ' );
$time_start = microtime( true ); 
$headers = get_headers( $file );
$execution_time = round( ( microtime( true ) - $time_start )/60, 8 );
print ( $execution_time . ' seconds <br />' );
print( '<pre>' . print_r( $headers, true ) . '</pre>' );



print( 'Starting get_headers() with context type Method : ' );
$time_start = microtime( true ); 
stream_context_set_default( array( 'http' => array( 'method' => 'HEAD', 'ignore_errors' => true ) ) );
$headers = get_headers( $file );
$execution_time = round( ( microtime( true ) - $time_start )/60, 8 );
print ( $execution_time . ' seconds <br />' );
print( '<pre>' . print_r( $headers, true ) . '</pre>' );



print( 'Starting file_get_contents Method : ' );
$time_start = microtime( true ); 
$context = stream_context_create( array( 'http' => array( 'method' => 'HEAD', 'ignore_errors' => true ) ) );
$file = file_get_contents( $file, false, $context );
$execution_time = round( ( microtime( true ) - $time_start )/60, 8 );
print ( $execution_time . ' seconds <br />' );
print( '<pre>' . print_r( $http_response_header, true ) . '</pre>' );











function getHeaders( $url ){
    $ch = curl_init();
    curl_setopt( $ch, CURLOPT_URL, $url );
    //curl_setopt( $ch, CURLOPT_RETURNTRANSFER, 1 );
    //curl_setopt( $ch, CURLOPT_VERBOSE, 0 );
    //curl_setopt( $ch, CURLOPT_HEADER, 1 );
    curl_setopt( $ch, CURLOPT_CUSTOMREQUEST, 'HEAD' );
    curl_exec( $ch );
    $info = curl_getinfo( $ch );
    curl_close( $ch );
    return $info;
}




?>

Outputs:

Starting CURL Method : 0.01373608 seconds 
Array
(
    [url] => https://www.projectmanagement-training.net/download/book_project_management.pdf
    [content_type] => 
    [http_code] => 0
    [header_size] => 0
    [request_size] => 0
    [filetime] => -1
    [ssl_verify_result] => 1
    [redirect_count] => 0
    [total_time] => 0.202
    [namelookup_time] => 0
    [connect_time] => 0.124
    [pretransfer_time] => 0
    [size_upload] => 0
    [size_download] => 0
    [speed_download] => 0
    [speed_upload] => 0
    [download_content_length] => -1
    [upload_content_length] => -1
    [starttransfer_time] => 0
    [redirect_time] => 0
    [redirect_url] => 
    [primary_ip] => 81.169.145.64
    [certinfo] => Array
        (
        )

    [primary_port] => 443
    [local_ip] => 127.0.0.1
    [local_port] => 62741
)
Starting get_headers() Method : 0.03559045 seconds 
Array
(
    [0] => HTTP/1.1 200 OK
    [1] => Date: Mon, 20 Apr 2015 15:28:28 GMT
    [2] => Server: Apache/2.2.29 (Unix)
    [3] => X-Powered-By: PHP/5.3.29
    [4] => Content-Disposition: attachment; filename="book_project_management.pdf"
    [5] => Content-Type: application/pdf
    [6] => Connection: close
)
Starting get_headers() with context type Method : 0.03277322 seconds 
Array
(
    [0] => HTTP/1.1 200 OK
    [1] => Date: Mon, 20 Apr 2015 15:28:30 GMT
    [2] => Server: Apache/2.2.29 (Unix)
    [3] => X-Powered-By: PHP/5.3.29
    [4] => Content-Disposition: attachment; filename="book_project_management.pdf"
    [5] => Content-Type: application/pdf
    [6] => Connection: close
)
Starting file_get_contents Method : 0.04345868 seconds 
Array
(
    [0] => HTTP/1.1 200 OK
    [1] => Date: Mon, 20 Apr 2015 15:28:33 GMT
    [2] => Server: Apache/2.2.29 (Unix)
    [3] => X-Powered-By: PHP/5.3.29
    [4] => Content-Disposition: attachment; filename="book_project_management.pdf"
    [5] => Content-Type: application/pdf
    [6] => Connection: close
)

If your goal is to only get the headers with this function, why not use the PHP built-in? :)

http://php.net/manual/en/function.get-headers.php

file_get_contents might be a quicker way of doing it as the options allow you to just return the header information :

<?php
    $url = "http://static.adzerk.net/Advertisers/831a088cf67e42c580e407e2d91c8ce6.jpg";

    $options = [
          'http' => [
               'method' => "HEAD",
               'ignore_errors' => 1
                ]
    ];

    $context = stream_context_create($options);
    $file = file_get_contents($url, false, $context);
    print_r($http_response_header);
?>

Although as mentioned PHPs stock function : http://php.net/manual/en/function.get-headers.php probably does the trick :)

Check these times in your $info array. These will tell you where the time is being spent:

CURLINFO_NAMELOOKUP_TIME
CURLINFO_CONNECT_TIME
CURLINFO_PRETRANSFER_TIME
CURLINFO_STARTTRANSFER_TIME
CURLINFO_SPEED_DOWNLOAD
CURLINFO_TOTAL_TIME

Test the link at these two sites:

http://www.webpagetest.org/ and
http://gtmetrix.com/


If using the get_headers() set the defaults for get_headers() with stream_context_set_default()

get_headers() uses stream_context_set_default() , so this is a valid option.

   stream_context_set_default(
        array(
            'http' => array(
                'method' => 'HEAD'
            )
        )
    );
    $headers = get_headers('http://example.com');

RE: curl

You are not going to get the header with this line commented out:

//curl_setopt( $ch, CURLOPT_HEADER, 1 );

Also you are not retrieving the data where the Response Header is located:

Set these options:

curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLINFO_HEADER_OUT, true);
curl_setopt($ch, CURLOPT_VERBOSE, true);

You need to add timeouts, and enable fail on error:

curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($ch, CURLOPT_TIMEOUT,10);
curl_setopt($ch, CURLOPT_FAILONERROR,true);
curl_setopt($ch, CURLOPT_ENCODING,"");



$data = curl_exec($ch);

if (curl_errno($ch)){
    $info['error'] = curl_error($ch);
}
else {
  $skip = intval(curl_getinfo($ch, CURLINFO_HEADER_SIZE)); 
  $requestHeader= substr($data,0,$skip);
  $info = curl_getinfo($ch);
  $info['requestHeader'] = $requestHeader;
}
return $info;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM