简体   繁体   中英

How to tell if a site has Google Analytics when there is no tracking code in source

I've got a scraper that I'm able to pass domains to and using regular expressions, my script can tell if the site has Google Analytics and/or Google Tag Manager installed:

 function checkUA($domain) {
     try{
         $input = file_get_contents($domain);
         if ( $input !== false ){
             $trackingPrefixes = ['UA', 'YT', 'MO', 'G', 'DC', 'AW'];

             if(preg_match_all(
                 '/\b
                     (?:
                        (?:' . implode('|', $trackingPrefixes) . ')-[A-Z\d]{4,10}(?:-[1-9]\d{0,3})?   #Tracker Ids
                   |
                   GTM-[A-Z\d]+                                                                  #Google Tag Manager Ids
                   |
                   googleanalytics_get_script
                )
            \b/x',
            $input,
            $matches
        )){
            return array_unique($matches[0]);
        }else{
            //if no match is found, let us know
            return 'no match found';
        }
    }else{
        return 'Site is blocked from crawlers';
    }
}catch(Exception $ex){
    return 'Site is blocked from crawlers';
}

}

My problem is, some sites have Google Analytics implemented via Google Tag Manager, so the tracking code won't be found in the source code of the site, so my script can't pick it up.

I'm guessing that tools like Google Tag Assistant and sites like this: https://builtwith.com/ use some other method to determine if Google Analytics is active on a site. I'm guessing they are using some kind of response headers to determine that instead of what I'm doing above.

Is there any way using PHP to tell if a site has Google Analytics active, without using regular expressions to read the source code?

Theoretically, it's possible to do what you're doing as you do it, but it's not practical and may be unjustifiably complex. Like, writing your own JS interpreter in PHP. Or parsing the whole gtm.js library.

Note that GTM is not the only thing you're not capturing by parsing html. You're also not capturing other TMSes like Matomo, Adobe Launch, Tealium or Ensighten. Finally, you're not capturing custom implementations through normal JS and front-end implementations through the use of the measurement protocol.

Every time when your parsing relates to what JS does, you want to actually render your fetched page, executing all JS that is implied to be executed and looking at the DOM and the JS context rather than just an html string.

JS makes rendering DOM trivial. On the backend, Node makes it trivial. There are multiple packages for properly rendering DOM. Either that, or just run Selenium on backend.

Extensions like the Tag Assistant, obviously, run on the front-end, so they rely on JS completely. Best analytics extensions would directly monitor the dataLayer and the network requests a page is sending to see if analytics-related requests are among them.

Selenium would be able to report the network requests happening back to php, but Selenium is quite a poor solution for mass usage since it takes way too much resources to run one instance. Even in the headless state. I imagine Node's Dom rendering libs would be the fastest option.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM