简体   繁体   中英

Http header User-Agent

I am trying to get list of browsers from the User-Agent strings in HTTP header. In many of the strings, the browser info is the second entry in the string, like the following:

(compatible;.MSIE.8.0;.Windows.NT.5.1;.Trident/4.0)

But in some of the strings, there is either no browser info, or the info comes as the 3rd entry like in the followings:

(Macintosh;.Intel.Mac.OS.X.10_6_1;.U;.so)
(Macintosh;.Intel.Mac.OS.X.10_6_1;.so)

How to approach this? Is there anything in Python for handling HTTP header fields? Many thanks.

I wrote a User Agent analyzer in PHP some time back, so it might be a bit off date, but hope it helps. I extracted the browser info, operating system and language, but I will only include the browser info here.

All major browser names are included in the UA string, but Mozilla is in almoust every one, for Firefox, use the string Firefox. So create an array with the content:

browserList = {'Opera': 'Opera', 
    'Internet Explorer': 'MSIE',
    'Firefox': 'Firefox',
    'Chrome': 'Chrome',
    'Not specified' => ''}

Then try to match these on the UA string. You can add more browsers, if you wan't to expand your statistics. As for the version number, in most cases it occurs right after the browser name. So try extracting the first number-dot-number right after the index you found the browser name.

Your visitor might be a crawler (a bot, like Google's), you can find these by matching against this list:

nuhk, Googlebot, Yammybot, Openbot, Slurp, MSNBot, Ask Jeeves/Teoma, ia_archiver

Hope this helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM