I have a file with the following input:
ID time count
100000458 18 1
100000458 18 1
100000458 18 1
100000458 18 1
100000458 18 1
100000458 17 1
100000458 17 1
100000458 17 1
100000458 17 1
100005361 00 1
100005361 10 1
100005361 10 1
100005361 10 1
what I want to achieve is an output which prints the maximum occurring time of a particular id along with the frequency. eg
[100000458 18 5]
[100005361 10 3]
and so on. and if there is a tie then print both times along with the frequency.
I believe using a dictionary in python will be the best way to go but I have been unable to implement a nested dictionary. Other option is to use a list but not sure how well it will scale for large datasets. Any help will be much appreciated.
If input is already grouped by id and time as in the example in your question then you could use itertools.groupby()
to compute the statistics on the fly:
#!/usr/bin/env python
import sys
from itertools import groupby
file = sys.stdin
next(file) # skip header line
lines = (line.split() for line in file if line.strip())
for id, same_id in groupby(lines, key=lambda x: x[0]): # by id
max_time, max_count = None, 0
for time, same_time in groupby(same_id, key=lambda x: x[1]): # by time
count = sum(int(c) for _, _, c in same_time)
if count > max_count:
max_time, max_count = time, count
print("{} {} {}".format(id, max_time, max_count))
100000458 18 5
100005361 10 3
This could be a really simple solution. Let's say the input string is in the variable inpStr .
result = dict()
for line in inpStr.splitlines():
id, time, count = line.split()
# If it is the first time that I see id
if id not in result:
result[id] = dict()
# this is the key line. I create a dictionary of dictionaries
result[id][time] = result[id].get(time, 0) + int(count)
# Once I finished looping through the list I need to find the maximum
# occurring time of a particular id
for id in result:
for time in result[id]:
if result[id][time] == max(result[id].values()):
print id, time, result[id][time]
Another Counter()
-based solution:
#!/usr/bin/env python
import sys
from collections import Counter, defaultdict
file = sys.stdin
next(file) # skip header line
# collect counts
counts = defaultdict(Counter) # ID -> (time, count) mapping
for id, time, count in (line.split() for line in file if line.strip()):
counts[id] += Counter({time: int(count)})
# print most common timestamps
for id, c in counts.items():
time, count = c.most_common(1)[0]
print("{id} {time} {count}".format(**vars()))
100005361 10 3
100000458 18 5
Prerequisite pandas version:
import pandas
d = pandas.read_table('test.txt', delimiter=r' *')
print d.groupby('ID').agg({'time': max, 'count': sum})
If you want the output to look exactly like you said, you need a little more work:
for (ID, i) in perid.iterrows():
print [ID, i['time'], i['count']]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.