简体   繁体   中英

Can't remove an ampersand from the end of a URL in PHP

I've been breaking my head with a hammer to figure this out but here goes. I'm currently scraping some pages that I get from various source and the URLs often have Google Analytics crap attached to the end of it, in this fashion:

&utm_medium=something&utm_source=other

And I'm trying to get rid of those from a URL. Since these are appended at the end of a URL, I do this:

 $pattern = "^utm_source.*^";
 $interUrl = preg_replace($pattern, '', $url);

utm_source is a required portion of the URL for google analytics. Here's my problem shows up. For some reason, I can't get the pattern to match an ampersand like so: "^\\&utm_source.*^". Without the ampersand (and its escape), I get matches. So I thought "no biggie, I'll just to a substr" like so:

 $finalUrl = substr($interUrl, 0, strlen($interUrl) - 1);

But nothing happens. I increased the -1 number to -3 or even -4 but nothing got cut off, not even characters after the ampersand. I've also tried str_replace and even rtrim but none could filter out the ampersand. This is frustrating since I am left with the wrong URL. Not only that, when I try to curl the page, I get a 404 while if I go to that site via my browser, i get redirected to the right page.

Any ideas on why this is happening?

ANSWER

While all the answers were nice and technical, I kept trying shit with the regex until I figured something out. The URLs were, for some reason (probably my retrieval method), being encoded so I ended up tweaking the regex like so:

$pattern = "/&utm_source.*/";

and it works! Thanks for everyone's help!

in your case adding & in front of reg expression makes the deal ^&utm_source.*^

<?php 
  $ptn = "^&utm_source.*^";
  $str = "http://someurl.com?index.php&utm_medium=something&utm_source=other";
  $rpltxt = "";
  echo preg_replace($ptn, $rpltxt, $str); // http://someurl.com?index.php&utm_medium=something
?>

I am usually using explode() , to simplify things, but yet again you will need reassemble the url

but you may try parse_url() instead of regular expressions, it might be more appropriate in this case.

You can use a combination of parse_str and http_build_query

parse_str($url, $vars);

if (isset($vars['utm_source'])) unset($vars['utm_source']);
// unset any other unwanted params the same way...

$finalUrl = http_build_query($vars);

By using parse_url like someone else had suggested:

<?php
$str = 'http://www.mydomain.com/something.php?herp=derp&some=thing&utm_medium=something&utm_source=other';
$url_arr = parse_url($str);
$query_arr = explode('&', $url_arr['query']);
$final_arr = array();

for($i=0;$i<count($query_arr);$i++) {
        $tmp_arr = explode('=', $query_arr[$i]);
        if(!preg_match('/^utm_/', $tmp_arr[0])) {
                $final_query[] = $query_arr[$i];
        }
}

echo $finished_url = $url_arr['scheme'] . '://' . $url_arr['host'] . $url_arr['path'] . '?' . implode('&', $final_query);

//output: http://www.mydomain.com/something.php?herp=derp&some=thing

While all the answers were nice and technical, I kept trying shit with the regex until I figured something out. The URLs were, for some reason (probably my retrieval method), being encoded so I ended up tweaking the regex like so:

$pattern = "/&amp;utm_source.*/";

And it works.

Why didn't I catch it earlier? I'm running my app on laravel and whenever I use the logging system, it seems to use an actual ampersand instead of & thus it seemed like all was well.

At one point, I went to check the database as to what was happening and noticed that my URLs were ending with & instead of with & (it showed up this way on my view).

Thanks everyone!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM