The problem with me is bit hard to explain. I'm analyzing a Apache log file which following is one line from it.
112.135.128.20 - [13/May/2013:23:55:04 +0530] "GET /SVRClientWeb/ActionController HTTP/1.1" 302 2 "https://www.example.com/sample" "Mozilla/5.0 (iPhone; CPU iPhone OS 6_1_3 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Mobile/10B329" GET /SVRClientWeb/ActionController - HTTP/1.1 www.example.com
Some parts from my code:
df = df.rename(columns={'%>s': 'Status', '%b':'Bytes Returned',
'%h':'IP', '%l':'Username', '%r': 'Request', '%t': 'Time', '%u': 'Userid', '%{Referer}i': 'Referer', '%{User-Agent}i': 'Agent'})
df.index = pd.to_datetime(df.pop('Time'))
test = df.groupby(['IP', 'Agent']).size()
test.sort()
print test[-20:]
I read log file to a data frame and get the following output with hit counts and user agents.
IP Agent
74.86.158.106 Mozilla/5.0+(compatible; UptimeRobot/2.0; http://www.uptimerobot.com/) 369
203.81.107.103 Mozilla/5.0 (Windows NT 6.1; rv:21.0) Gecko/20100101 Firefox/21.0 388
173.199.120.155 Mozilla/5.0 (compatible; AhrefsBot/4.0; +http://ahrefs.com/robot/) 417
124.43.84.242 Mozilla/5.0 (Windows NT 6.2) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31 448
112.135.196.223 Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36 454
124.43.155.138 Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0 461
124.43.104.198 Mozilla/5.0 (Windows NT 5.1; rv:21.0) Gecko/20100101 Firefox/21.0 467
Then I want to get the
At least please explain me how to solve above problems?
To do the first part you could just sort the DataFrame (by count) and take the top three rows:
In [11]: df.sort('Count', ascending=False).head(3)
Out[11]:
IP Agent Count
6 124.43.104.198 Mozilla/5.0 (Windows NT 5.1; rv:21.0) Gecko/20... 467
5 124.43.155.138 Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) G... 461
4 112.135.196.223 Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.3... 454
To test whether there are multiple rows (Agents) for a single IP you can use groupby:
In [12]: g = df.groupby('IP')
In [13]: repeated = g.count().Count != 1
In [14]: repeated
Out[14]:
IP
112.135.196.223 False
124.43.104.198 False
124.43.155.138 False
124.43.84.242 False
173.199.120.155 False
203.81.107.103 False
74.86.158.106 False
Name: Count, dtype: bool
In [15]: repeated[repeated]
Out[15]: Series([], dtype: bool)
There are none in this example.
In order to avoid sorting the entire DataFrame, it's possible
and it could be more efficient (update: IT'S NOT)
to use heapq
(I don't think there is an nlargest in pandas):
In [21]: from heapq import nlargest
In [22]: top_3 = nlargest(3, df.iterrows(), key=lambda x: x[1]['Count'])
In [23]: pd.DataFrame.from_items(top_3).T
Out[23]:
IP Agent Count
6 124.43.104.198 Mozilla/5.0 (Windows NT 5.1; rv:21.0) Gecko/20... 467
5 124.43.155.138 Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) G... 461
4 112.135.196.223 Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.3... 454
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.