[英]Load All TXT Files In A External Directory
So I need to load all the txt files in: http://orcahub.com/unchecked-proxy-list/ as one txt file and go into my server which is a different one to Orcahub; 因此,我需要将所有txt文件作为一个txt文件加载到http://orcahub.com/unchecked-proxy-list/中 ,并进入与Orcahub不同的服务器中;
For some reason it wont work. 由于某种原因,它将无法正常工作。 I cant get it to actually get the HTML to even do regex.
我不能得到它甚至可以做正则表达式的HTML。
What I tried: 我试过的
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://orcahub.com/unchecked-proxy-list');
curl_setopt($ch, CURLOPT_HEADER, FALSE);
curl_setopt($ch, CURLOPT_NOBODY, FALSE); // remove body
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
$st = curl_exec($ch);
//curl_close($ch);
//preg_match_all("/(.*\.txt)/", $st, $out);
var_dump($ch);
?>
UPDATE: New issue, I get a Server Error 500 when I use the following script: UPDATE: Found out this issue was from a newline after the URL. 更新:新问题,使用以下脚本时出现服务器错误500:更新:发现此问题来自URL后面的换行符。
<?php
function disguise_curl($url) {
//Prepare Curl;
$curl = curl_init();
//Setup Headers (Firefox 2.0.0.6);
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: ";
//Setup Curl;
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_USERAGENT, 'Googlebot/2.1 (+http://www.google.com/bot.html)');
curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
curl_setopt($curl, CURLOPT_REFERER, 'http://orcahub.com/unchecked-proxy-list/');
curl_setopt($curl, CURLOPT_ENCODING, 'gzip,deflate');
curl_setopt($curl, CURLOPT_AUTOREFERER, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_TIMEOUT, 60);
//Execute Curl;
$html = curl_exec($curl);
//End Curl;
curl_close($curl);
//Output the HTML;
return $html;
}
function rem_href($x) { return substr(strstr($x, '>'), strlen('>')); }
$response = disguise_curl('http://orcahub.com/unchecked-proxy-list/');
preg_match_all("/<a[\s]+[^>]*?href[\s]?=[\s\"\']+"."(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/", $response, $matches, PREG_SET_ORDER );
foreach($matches as $value) {
$proxylists[] = 'http://orcahub.com/unchecked-proxy-list/'.rem_href($value[0]);
};
echo $proxylists[0];
$response = disguise_curl($proxylists[0]);
//Server Error 500 Here;
echo $response;
?>
Came accross from php.net a function that add headers to disguise the call, a regex I added for parsing the response: 来自php.net的一个函数添加了标题以伪装该调用,这是我为解析响应而添加的正则表达式:
function disguise_curl($url)
{
$curl = curl_init();
// Setup headers - I used the same headers from Firefox version 2.0.0.6
// below was split up because php.net said the line was too long. :/
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: "; // browsers keep this blank.
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_USERAGENT, 'Googlebot/2.1 (+http://www.google.com/bot.html)');
curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
curl_setopt($curl, CURLOPT_REFERER, 'http://www.google.com');
curl_setopt($curl, CURLOPT_ENCODING, 'gzip,deflate');
curl_setopt($curl, CURLOPT_AUTOREFERER, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_TIMEOUT, 10);
$html = curl_exec($curl); // execute the curl command
curl_close($curl); // close the connection
return $html; // and finally, return $html
}
$response = disguise_curl('http://orcahub.com/unchecked-proxy-list/');
preg_match_all("/<a[\s]+[^>]*?href[\s]?=[\s\"\']+"."(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/", $response, $matches, PREG_SET_ORDER );
foreach($matches as $value) {
var_dump($value);
};
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.