简体   繁体   中英

How to find the most common occurring timestamp and its frequency from a file having multiple entries for a user and time in python?

I have a file with the following input:

    ID    time count
100000458   18  1
100000458   18  1
100000458   18  1
100000458   18  1
100000458   18  1
100000458   17  1
100000458   17  1
100000458   17  1
100000458   17  1
100005361   00  1
100005361   10  1
100005361   10  1
100005361   10  1

what I want to achieve is an output which prints the maximum occurring time of a particular id along with the frequency. eg

[100000458 18 5]
[100005361 10 3]

and so on. and if there is a tie then print both times along with the frequency.

I believe using a dictionary in python will be the best way to go but I have been unable to implement a nested dictionary. Other option is to use a list but not sure how well it will scale for large datasets. Any help will be much appreciated.

If input is already grouped by id and time as in the example in your question then you could use itertools.groupby() to compute the statistics on the fly:

#!/usr/bin/env python
import sys
from itertools import groupby

file = sys.stdin
next(file) # skip header line

lines = (line.split() for line in file if line.strip())
for id, same_id in groupby(lines, key=lambda x: x[0]): # by id
    max_time, max_count = None, 0
    for time, same_time in groupby(same_id, key=lambda x: x[1]): # by time
        count = sum(int(c) for _, _, c in same_time)
        if count > max_count:
            max_time, max_count = time, count
    print("{} {} {}".format(id, max_time, max_count))

Output

100000458 18 5
100005361 10 3

This could be a really simple solution. Let's say the input string is in the variable inpStr .

result = dict()
for line in inpStr.splitlines():
    id, time, count = line.split()
    # If it is the first time that I see id
    if id not in result:
        result[id] = dict()
    # this is the key line. I create a dictionary of dictionaries
    result[id][time] = result[id].get(time, 0) + int(count)

# Once I finished looping through the list I need to find the maximum
# occurring time of a particular id
for id in result:
    for time in result[id]:
        if result[id][time] == max(result[id].values()):
            print id, time, result[id][time]

Another Counter() -based solution:

#!/usr/bin/env python
import sys
from collections import Counter, defaultdict

file = sys.stdin
next(file) # skip header line

# collect counts
counts = defaultdict(Counter) # ID -> (time, count) mapping
for id, time, count in (line.split() for line in file if line.strip()):
    counts[id] += Counter({time: int(count)})

# print most common timestamps
for id, c in counts.items():
    time, count = c.most_common(1)[0]
    print("{id} {time} {count}".format(**vars()))

Output

100005361 10 3
100000458 18 5

Prerequisite pandas version:

import pandas

d = pandas.read_table('test.txt', delimiter=r' *')
print d.groupby('ID').agg({'time': max, 'count': sum})

If you want the output to look exactly like you said, you need a little more work:

for (ID, i) in perid.iterrows():
    print [ID, i['time'], i['count']]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM