简体   繁体   English

将所有TXT文件加载到外部目录中

[英]Load All TXT Files In A External Directory

So I need to load all the txt files in: http://orcahub.com/unchecked-proxy-list/ as one txt file and go into my server which is a different one to Orcahub; 因此,我需要将所有txt文件作为一个txt文件加载到http://orcahub.com/unchecked-proxy-list/中 ,并进入与Orcahub不同的服务器中;

For some reason it wont work. 由于某种原因,它将无法正常工作。 I cant get it to actually get the HTML to even do regex. 我不能得到它甚至可以做正则表达式的HTML。

What I tried: 我试过的

<?php

$ch = curl_init(); 
curl_setopt($ch, CURLOPT_URL, 'http://orcahub.com/unchecked-proxy-list'); 
curl_setopt($ch, CURLOPT_HEADER, FALSE); 
curl_setopt($ch, CURLOPT_NOBODY, FALSE); // remove body 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); 
$st = curl_exec($ch); 
//curl_close($ch); 

//preg_match_all("/(.*\.txt)/", $st, $out);

var_dump($ch);
?>

UPDATE: New issue, I get a Server Error 500 when I use the following script: UPDATE: Found out this issue was from a newline after the URL. 更新:新问题,使用以下脚本时出现服务器错误500:更新:发现此问题来自URL后面的换行符。

<?php

function disguise_curl($url) {

    //Prepare Curl;
    $curl = curl_init();

    //Setup Headers (Firefox 2.0.0.6);
    $header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,"; 
    $header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5"; 
    $header[] = "Cache-Control: max-age=0"; 
    $header[] = "Connection: keep-alive"; 
    $header[] = "Keep-Alive: 300"; 
    $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7"; 
    $header[] = "Accept-Language: en-us,en;q=0.5"; 
    $header[] = "Pragma: ";

    //Setup Curl;
    curl_setopt($curl, CURLOPT_URL, $url); 
    curl_setopt($curl, CURLOPT_USERAGENT, 'Googlebot/2.1 (+http://www.google.com/bot.html)'); 
    curl_setopt($curl, CURLOPT_HTTPHEADER, $header); 
    curl_setopt($curl, CURLOPT_REFERER, 'http://orcahub.com/unchecked-proxy-list/'); 
    curl_setopt($curl, CURLOPT_ENCODING, 'gzip,deflate'); 
    curl_setopt($curl, CURLOPT_AUTOREFERER, true); 
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); 
    curl_setopt($curl, CURLOPT_TIMEOUT, 60); 

    //Execute Curl;
    $html = curl_exec($curl);

    //End Curl;
    curl_close($curl);

    //Output the HTML;
    return $html;

}

function rem_href($x) { return substr(strstr($x, '>'), strlen('>')); }

$response = disguise_curl('http://orcahub.com/unchecked-proxy-list/'); 
preg_match_all("/<a[\s]+[^>]*?href[\s]?=[\s\"\']+"."(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/", $response, $matches, PREG_SET_ORDER );

foreach($matches as $value) { 
    $proxylists[] = 'http://orcahub.com/unchecked-proxy-list/'.rem_href($value[0]);
};

echo $proxylists[0];

$response = disguise_curl($proxylists[0]);
//Server Error 500 Here;
echo $response;

?>

Came accross from php.net a function that add headers to disguise the call, a regex I added for parsing the response: 来自php.net的一个函数添加了标题以伪装该调用,这是我为解析响应而添加的正则表达式:

function disguise_curl($url) 
{ 
  $curl = curl_init(); 

  // Setup headers - I used the same headers from Firefox version 2.0.0.6 
  // below was split up because php.net said the line was too long. :/ 
  $header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,"; 
  $header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5"; 
  $header[] = "Cache-Control: max-age=0"; 
  $header[] = "Connection: keep-alive"; 
  $header[] = "Keep-Alive: 300"; 
  $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7"; 
  $header[] = "Accept-Language: en-us,en;q=0.5"; 
  $header[] = "Pragma: "; // browsers keep this blank. 

  curl_setopt($curl, CURLOPT_URL, $url); 
  curl_setopt($curl, CURLOPT_USERAGENT, 'Googlebot/2.1 (+http://www.google.com/bot.html)'); 
  curl_setopt($curl, CURLOPT_HTTPHEADER, $header); 
  curl_setopt($curl, CURLOPT_REFERER, 'http://www.google.com'); 
  curl_setopt($curl, CURLOPT_ENCODING, 'gzip,deflate'); 
  curl_setopt($curl, CURLOPT_AUTOREFERER, true); 
  curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); 
  curl_setopt($curl, CURLOPT_TIMEOUT, 10); 

  $html = curl_exec($curl); // execute the curl command 
  curl_close($curl); // close the connection 

  return $html; // and finally, return $html 
} 

$response = disguise_curl('http://orcahub.com/unchecked-proxy-list/'); 
preg_match_all("/<a[\s]+[^>]*?href[\s]?=[\s\"\']+"."(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/", $response, $matches, PREG_SET_ORDER );

foreach($matches as $value) { 
    var_dump($value);
}; 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM