I've been breaking my head with a hammer to figure this out but here goes. I'm currently scraping some pages that I get from various source and the URLs often have Google Analytics crap attached to the end of it, in this fashion:
&utm_medium=something&utm_source=other
And I'm trying to get rid of those from a URL. Since these are appended at the end of a URL, I do this:
$pattern = "^utm_source.*^";
$interUrl = preg_replace($pattern, '', $url);
utm_source is a required portion of the URL for google analytics. Here's my problem shows up. For some reason, I can't get the pattern to match an ampersand like so: "^\\&utm_source.*^". Without the ampersand (and its escape), I get matches. So I thought "no biggie, I'll just to a substr" like so:
$finalUrl = substr($interUrl, 0, strlen($interUrl) - 1);
But nothing happens. I increased the -1 number to -3 or even -4 but nothing got cut off, not even characters after the ampersand. I've also tried str_replace and even rtrim but none could filter out the ampersand. This is frustrating since I am left with the wrong URL. Not only that, when I try to curl the page, I get a 404 while if I go to that site via my browser, i get redirected to the right page.
Any ideas on why this is happening?
ANSWER
While all the answers were nice and technical, I kept trying shit with the regex until I figured something out. The URLs were, for some reason (probably my retrieval method), being encoded so I ended up tweaking the regex like so:
$pattern = "/&utm_source.*/";
and it works! Thanks for everyone's help!
in your case adding &
in front of reg expression makes the deal ^&utm_source.*^
<?php
$ptn = "^&utm_source.*^";
$str = "http://someurl.com?index.php&utm_medium=something&utm_source=other";
$rpltxt = "";
echo preg_replace($ptn, $rpltxt, $str); // http://someurl.com?index.php&utm_medium=something
?>
I am usually using explode()
, to simplify things, but yet again you will need reassemble the url
but you may try parse_url()
instead of regular expressions, it might be more appropriate in this case.
You can use a combination of parse_str
and http_build_query
parse_str($url, $vars);
if (isset($vars['utm_source'])) unset($vars['utm_source']);
// unset any other unwanted params the same way...
$finalUrl = http_build_query($vars);
By using parse_url like someone else had suggested:
<?php
$str = 'http://www.mydomain.com/something.php?herp=derp&some=thing&utm_medium=something&utm_source=other';
$url_arr = parse_url($str);
$query_arr = explode('&', $url_arr['query']);
$final_arr = array();
for($i=0;$i<count($query_arr);$i++) {
$tmp_arr = explode('=', $query_arr[$i]);
if(!preg_match('/^utm_/', $tmp_arr[0])) {
$final_query[] = $query_arr[$i];
}
}
echo $finished_url = $url_arr['scheme'] . '://' . $url_arr['host'] . $url_arr['path'] . '?' . implode('&', $final_query);
//output: http://www.mydomain.com/something.php?herp=derp&some=thing
While all the answers were nice and technical, I kept trying shit with the regex until I figured something out. The URLs were, for some reason (probably my retrieval method), being encoded so I ended up tweaking the regex like so:
$pattern = "/&utm_source.*/";
And it works.
Why didn't I catch it earlier? I'm running my app on laravel and whenever I use the logging system, it seems to use an actual ampersand instead of & thus it seemed like all was well.
At one point, I went to check the database as to what was happening and noticed that my URLs were ending with & instead of with & (it showed up this way on my view).
Thanks everyone!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.