简体   繁体   English

Preg-replace - 替换除域及其子域之外的所有URL

[英]Preg-replace - replace all URLs except a domain and its subdomains

I've a Glype proxy and I want not parse external URLs. 我有一个Glype代理,我不想解析外部URL。 All URLs on the page are automatically converted to: http://proxy.com/browse.php?u=[URL HERE] . 页面上的所有网址都会自动转换为: http//proxy.com/browse.php?u = [urL此处] Example: If I visit The Pirate Bay on my proxy, then I want not to parse the following URLs: 示例:如果我在我的代理上访问The Pirate Bay,那么我不想解析以下URL:

ByteLove.com (Not to: http://proxy.com/browse.php?u=http://bytelove.com&b=0)
BayFiles.com (Not to: http://proxy.com/browse.php?u=http://bayfiles.com&b=0)
BayIMG.com (Not to: http://proxy.com/browse.php?u=http://bayimg.com&b=0)
PasteBay.com (Not to: http://proxy.com/browse.php?u=http://pastebay.com&b=0)
Ipredator.com (Not to: http://proxy.com/browse.php?u=https://ipredator.se&b=0)
etc.

Of course I want to keep the internal URLs, so: 当然我想保留内部URL,所以:

thepiratebay.se/browse (To: http://proxy.com/browse.php?u=http://thepiratebay.se/browse&b=0)
thepiratebay.se/top (To: http://proxy.com/browse.php?u=http://thepiratebay.se/top&b=0)
thepiratebay.se/recent (To: http://proxy.com/browse.php?u=http://thepiratebay.se/recent&b=0)
etc.

Is there a preg_replace to replace all URL's except thepiratebay.se and there subdomains (as in the example)? 有没有preg_replace来替换除了thepiratebay.se和子域之外的所有URL(如示例中所示)? An other function is also welcome. 另一个功能也欢迎。 (Such as domdocument, querypath, substr or strpos. Not str_replace because then I should define all URLs) (例如domdocument,querypath,substr或strpos。不是str_replace因为那时我应该定义所有的URL)

I've found something, but I'm not familiar with preg_replace: 我找到了一些东西,但我不熟悉preg_replace:

$exclude = '.thepiratebay.se';
$pattern = '(https?\:\/\/.*?\..*?)(?=\s|$)';
$message= preg_replace("~(($exclude)?($pattern))~i", '$2<a href="$4" target="_blank">$5</a>$6', $message);

you can use preg_replace_callback() to execute a callback function for every match. 您可以使用preg_replace_callback()为每个匹配执行回调函数。 In that function you can determine if the matched string should be converted or not. 在该函数中,您可以确定是否应转换匹配的字符串。

<?php
$string = 'http://foobar.com/baz and http://example.org/bumm';
$pattern = '#(https?\:\/\/.*?\..*?)(?=\s|$)#i';
$string = preg_replace_callback($pattern, function($match) {
    if (stripos($match[0], 'example.org/') !== false) {
        // exclude all URLs containing example.org
        return $match[0];
    } else {
        return 'http://proxy.com/?u=' . urlencode($match[0]);
    }
}, $string);

echo $string, "\n";

(Example is using PHP 5.3 closure notation) (示例使用PHP 5.3闭包表示法)

I'll guess you would need to provide a whitelist to tell which domains should be proxied: 我猜你需要提供一个白名单来告诉应该代理哪些域名:

$whitelist = array();
$whitelist[] = "internal1.se";
$whitelist[] = "internal2.no";
$whitelist[] = "internal3.com";
// and so on...

$string = '<a href="http://proxy.org/browse.php?u=http%3A%2F%2Fexternal1.com&b=0">External link 1</a><br>';
$string .=  '<a href="http://proxy.org/browse.php?u=http%3A%2F%2Finternal1.se&b=0">Internal link 1</a><br>';
$string .=  '<a href="http://proxy.org/browse.php?u=http%3A%2F%2Finternal3.com&b=0">Internal link 2</a><br>';
$string .=  '<a href="http://proxy.org/browse.php?u=http%3A%2F%2Fexternal2.no&b=0">External link 2</a><br>';

//Assuming the URL always is inside '' or "" you can use this pattern:
$pattern = '#(https?://proxy\.org/browse\.php\?u=(https?[^&|\"|\']*)(&?[^&|\"|\']*))#i';

$string = preg_replace_callback($pattern, "my_callback", $string);

//I had only PHP 5.2 on my server, so I decided to use a callback function. 
function my_callback($match) {
    global $whitelist;
    // set return bypass proxy URL
    $returnstring = urldecode($match[2]);

    foreach ($whitelist as $white) {
        // check if URL matches whitelist
        if (stripos($match[2], $white) > 0) {
            $returnstring = $match[0];
            break; } }
    return $returnstring;
}

echo "NEW STRING[:\n" . $string . "\n]\n";

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM