简体   繁体   中英

Analyse python list with algorithm for counting occurences over date ranges

The following shows the structure of some data I have (format: a list of lists)

data = 
[ 
  [1,2008-12-01],
  [1,2008-12-01],
  [2,2008-12-01]
  ... (the lists continue)
]

The dates range from 2008-12-01 to 2008-12-25.

The first field identifies a user by id, the second field (a date field) shows when this user visited a page on my site.

I need to analyse this data so that i get the following results

25 users visted on 1 day
100 users visted on 2 days
300 users visted on 4 days
... upto 25 days

I am using python and don't know where to start !

EDIT

I'm sorry it seems I wasnt clear enough about what I needed as a few people have given answers that are not what I'm looking for.

I need to find out how many users visited on all the days eg
10 users visited on 25 days (or every day)

Then I'd like the to list the same for each frequency of days from 1 - 25. So as per my original example above
25 users visited for only one day (out of the 25)
100 users visited on 2 days (out of the 25)
etc

I DONT need to know how many visited on each day
thanks

Your result is a dictionary, right?

{ userNumber: setOfDays }

How about this to get started.

from collections import defaultdict
visits = defaultdict(set)
for user, date in someList:
    visits[user].add(date)

This gives you a dictionary with a set of dates on which they visited.

counts = defaultdict(int)
for user in visits:
    v= len(visits[user])
    count[v] += 1

This gives you a dictionary of # visits, # of users with that many visits.

Is that the kind of thing you're looking for?

Rewriting S.Lott's answer in SQL as an exercise, just to check that I got the requirements right...

SELECT * FROM someList;

 userid |    date    
--------+------------
      1 | 2008-12-01
      1 | 2008-12-02
      1 | 2008-12-03
      1 | 2008-12-04
      1 | 2008-12-05
      2 | 2008-12-03
      2 | 2008-12-04
      2 | 2008-12-05
      3 | 2008-12-04
      4 | 2008-12-04
      5 | 2008-12-05
      5 | 2008-12-05

SELECT countdates, COUNT(userid) AS nusers
FROM ( SELECT userid, COUNT (DISTINCT date) AS countdates
             FROM someList
             GROUP BY userid ) AS visits
GROUP BY countdates
HAVING countdates <= 25
ORDER BY countdates;

 countdates | nusers 
------------+--------
          1 |      3
          3 |      1
          5 |      1

This is probably not the most pythonic or efficient or smartest or whatever way of doing this. But maybe you can confirm if I've understood the requirements correctly:

>>> log=[[1, '2008-12-01'], [1, '2008-12-01'],[2, '2008-12-01'],[2, '2008-12-03'], [1, '2008-12-04'], [3, '2008-12-04'], [4, '2008-12-04']]
>>> all_dates = sorted(set([d for d in [x[1] for x in log]]))
>>> for i in range(0, len(all_dates)):
...     log_slice = [d for d in log if d[1] <= all_dates[i]]
...     num_users = len(set([u for u in [x[0] for x in log_slice]]))
...     print "%d users visited in %d days" % (num_users, i + 1)
... 
2 users visited in 1 days
2 users visited in 2 days
4 users visited in 3 days
>>> 

First, I should mention that you NEED to store the date as a string. Currently, it would do arithmetic on your current entry. So, if you format data like this, it will work better:

data = 
[ 
  [1,"2008-12-01"],
  [1,"2008-12-01"],
  [2,"2008-12-01"]
]

Next, we can do something like this to get the number for each day:

result = {}
for (id, date) in data:
    if date not in result:
        result[date] = 1
    else:
        result[date] += 1

Now you can get the number of users for a specific date by doing something like this:

print result[some_date]

It is unclear what exactly your requirement are. Here's my take:

#!/usr/bin/env python
from collections import defaultdict

data = [ 
  [1,'2008-12-01'],
  [3,'2008-12-25'],
  [1,'2008-12-01'],
  [2,'2008-12-01'],
]

d = defaultdict(set)
for id, day in data:
    d[day].add(id)

for day in sorted(d):
    print('%d user(s) visited on %s' % (len(d[day]), day))

It prints:

2 user(s) visited on 2008-12-01
1 user(s) visited on 2008-12-25

How about this: this gives you set of days as well as count:

In [39]: from itertools import groupby ##itertools is a part of the standard library.

In [40]: l=[[1, '2008-12-01'],
   ....:  [1, '2008-12-01'],
   ....:  [2, '2008-12-01'],
   ....:  [1, '2008-12-01'],
   ....:  [3, '3008-12-04']]

In [41]: l.sort()

In [42]: l
Out[42]: 
[[1, '2008-12-01'],
 [1, '2008-12-01'],
 [1, '2008-12-01'],
 [2, '2008-12-01'],
 [3, '3008-12-04']]

In [43]: for key, group in groupby(l, lambda x: x[0]):
   ....:     group=list(group)
   ....:     print key,' :: ', len(group), ' :: ', group
   ....:     
   ....:     
1  ::  3  ::  [[1, '2008-12-01'], [1, '2008-12-01'], [1, '2008-12-01']]
2  ::  1  ::  [[2, '2008-12-01']]
3  ::  1  ::  [[3, '3008-12-04']]

user::number of visits :: visit dates

Here the user -1 visits on 2008-12-01 3 times, if you are looking to count only distinct dates then

for key, group in groupby(l, lambda x: x[0]):
    group=list(group)
    print key,' :: ', len(set([(lambda y: y[1])(each) for each  in group])), ' :: ', group
   ....:     
   ....:     
1  ::  1  ::  [[1, '2008-12-01'], [1, '2008-12-01'], [1, '2008-12-01']]
2  ::  1  ::  [[2, '2008-12-01']]
3  ::  1  ::  [[3, '3008-12-04']]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM