How to find the most common occurring timestamp and its frequency from a file having multiple entries for a user and time in python?

Question

I have a file with the following input:

    ID    time count
100000458   18  1
100000458   18  1
100000458   18  1
100000458   18  1
100000458   18  1
100000458   17  1
100000458   17  1
100000458   17  1
100000458   17  1
100005361   00  1
100005361   10  1
100005361   10  1
100005361   10  1

what I want to achieve is an output which prints the maximum occurring time of a particular id along with the frequency. eg

[100000458 18 5]
[100005361 10 3]

and so on. and if there is a tie then print both times along with the frequency.

I believe using a dictionary in python will be the best way to go but I have been unable to implement a nested dictionary. Other option is to use a list but not sure how well it will scale for large datasets. Any help will be much appreciated.

Answer 1

If input is already grouped by id and time as in the example in your question then you could use itertools.groupby() to compute the statistics on the fly:

#!/usr/bin/env python
import sys
from itertools import groupby

file = sys.stdin
next(file) # skip header line

lines = (line.split() for line in file if line.strip())
for id, same_id in groupby(lines, key=lambda x: x[0]): # by id
    max_time, max_count = None, 0
    for time, same_time in groupby(same_id, key=lambda x: x[1]): # by time
        count = sum(int(c) for _, _, c in same_time)
        if count > max_count:
            max_time, max_count = time, count
    print("{} {} {}".format(id, max_time, max_count))

Output

100000458 18 5
100005361 10 3

Answer 2

This could be a really simple solution. Let's say the input string is in the variable inpStr .

result = dict()
for line in inpStr.splitlines():
    id, time, count = line.split()
    # If it is the first time that I see id
    if id not in result:
        result[id] = dict()
    # this is the key line. I create a dictionary of dictionaries
    result[id][time] = result[id].get(time, 0) + int(count)

# Once I finished looping through the list I need to find the maximum
# occurring time of a particular id
for id in result:
    for time in result[id]:
        if result[id][time] == max(result[id].values()):
            print id, time, result[id][time]

Answer 3

Another Counter() -based solution:

#!/usr/bin/env python
import sys
from collections import Counter, defaultdict

file = sys.stdin
next(file) # skip header line

# collect counts
counts = defaultdict(Counter) # ID -> (time, count) mapping
for id, time, count in (line.split() for line in file if line.strip()):
    counts[id] += Counter({time: int(count)})

# print most common timestamps
for id, c in counts.items():
    time, count = c.most_common(1)[0]
    print("{id} {time} {count}".format(**vars()))

Output

100005361 10 3
100000458 18 5

Answer 4

Prerequisite pandas version:

import pandas

d = pandas.read_table('test.txt', delimiter=r' *')
print d.groupby('ID').agg({'time': max, 'count': sum})

If you want the output to look exactly like you said, you need a little more work:

for (ID, i) in perid.iterrows():
    print [ID, i['time'], i['count']]

How to find the most common occurring timestamp and its frequency from a file having multiple entries for a user and time in python?

Question

4 answers

solution1
3 ACCPTED 2014-03-03 02:43:24

Output

solution2
0 2014-03-03 02:18:00

solution3
0 2014-03-03 02:23:01

Output

solution4
0 2014-03-03 03:08:04

How to find the most common occurring timestamp and its frequency from a file having multiple entries for a user and time in python?

Question

4 answers

solution1 3 ACCPTED 2014-03-03 02:43:24

Output

solution2 0 2014-03-03 02:18:00

solution3 0 2014-03-03 02:23:01

Output

solution4 0 2014-03-03 03:08:04

solution1
3 ACCPTED 2014-03-03 02:43:24

solution2
0 2014-03-03 02:18:00

solution3
0 2014-03-03 02:23:01

solution4
0 2014-03-03 03:08:04