简体   繁体   中英

Optimal/fastest way to count dates in a python list

I have a list of dates and the goal is to count the occurrences of each date while maintaining the order in which they appear in the original list. Consider the following example:

The list only_dates looks like this:

[datetime.date(2017, 3, 9), datetime.date(2017, 3, 10), datetime.date(2017, 3, 10), datetime.date(2017, 3, 11)]

I am trying to use groupby :

import itertools
day_wise_counts = [(k, len(list(g))) for k, g in itertools.groupby(only_dates)]
print(str(day_wise_counts))

This prints

[(datetime.date(2017, 3, 10), 1), (datetime.date(2017, 3, 9), 1), (datetime.date(2017, 3, 10), 1), (datetime.date(2017, 3, 11), 1)]

I understand this is happening because ultimately each date object is being treated as a different one while grouping.

I was expecting the output to be:

[(datetime.date(2017, 3, 9), 1), (datetime.date(2017, 3, 10), 2), (datetime.date(2017, 3, 11), 1)]

I am not necessarily looking for a list of tuples. A dictionary output will also suffice as long as the original order of dates is maintained. ( OrderedDict maybe).

How can I achieve this?

Update: There are possible multiple approaches being suggested which all work well. But I should have mentioned that I'll be doing this operation for a large amount data. So it would be great if your solution is optimal one in terms of running time. Please edit your answer/comment accordingly if you can.

Update 2: The size of data can be as large as 1 million rows.

Indeed, you could use an OrderedDict :

from collections import OrderedDict
import datetime

inp = [datetime.date(2017, 3, 9), datetime.date(2017, 3, 10),
       datetime.date(2017, 3, 10), datetime.date(2017, 3, 11)]

odct = OrderedDict()
for item in inp:
    try:
        odct[item] += 1
    except KeyError:
        odct[item] = 1

print(odct)

which prints:

OrderedDict([(datetime.date(2017, 3, 9), 1),
             (datetime.date(2017, 3, 10), 2),
             (datetime.date(2017, 3, 11), 1)])

You also asked for timings, so here they are:

from collections import OrderedDict, Counter
import datetime
import random

# Functions

def ordereddict(inp):
    odct = OrderedDict()
    for item in inp:
        try:
            odct[item] += 1
        except KeyError:
            odct[item] = 1
    return odct


def dawg(inp):
    cnts=Counter(inp)
    seen=set()
    return [(e, cnts[e]) for e in inp if not (e in seen or seen.add(e))]


def chris1(inp):
    return [(item, inp.count(item)) for item in list(OrderedDict.fromkeys(inp))]


def chris2(inp):
    c = Counter(inp)
    return [(item,c[item]) for item in list(OrderedDict.fromkeys(inp))]


# Taken from answer: https://stackoverflow.com/a/23747652/5393381
class OrderedCounter(Counter, OrderedDict):  
    'Counter that remembers the order elements are first encountered'

    def __repr__(self):
        return '%s(%r)' % (self.__class__.__name__, OrderedDict(self))

    def __reduce__(self):
        return self.__class__, (OrderedDict(self),)


# Timing setup
timings = {ordereddict: [], dawg: [], chris1: [], chris2: [], OrderedCounter: []}
sizes = [2**i for i in range(1, 20)]

# Timing
for size in sizes:
    func_input = [datetime.date(2017, random.randint(1, 12), random.randint(1, 28)) for _ in range(size)]
    for func in timings:
        res = %timeit -o func(func_input)   # if you use IPython, otherwise use the "timeit" module
        timings[func].append(res)

and plotted:

%matplotlib notebook

import matplotlib.pyplot as plt
import numpy as np

fig = plt.figure(1)
ax = plt.subplot(111)

for func in timings:
    ax.plot([2**i for i in range(1, 20)], 
            [time.best for time in timings[func]], 
            label=str(func.__name__))
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('size')
ax.set_ylabel('time [seconds]')
ax.grid(which='both')
ax.legend()
plt.tight_layout()

在此输入图像描述

I timed it on Python-3.5. The approaches using Counter will likely be a bit slower on python-2.x ( Counter was optimized for python-3.x). Also the chris2 and dawg approach overlap each other (because there is almost no time difference between them).

So except for the first approach of @Chris_Rands and the OrderedCounter - the approaches perform very similar and mostly depend on the number of duplicates in your list.

It's mostly a factor of 1.5-2 difference. I couldn't find any real time difference for 1 million items bwteen the 3 "fast" approaches.

You could use list.count() with aa list comprehension iterating over a list derived from an OrderedDict of unique ordered dates:

import datetime
from collections import OrderedDict

lst = [datetime.date(2017, 3, 9), datetime.date(2017, 3, 10), datetime.date(2017, 3, 10), datetime.date(2017, 3, 11)]

[(item,lst.count(item)) for item in list(OrderedDict.fromkeys(lst))]
# [(datetime.date(2017, 3, 9), 1), (datetime.date(2017, 3, 10), 2), (datetime.date(2017, 3, 11), 1)]

Or similarly using a collections.Counter instead of list.count :

from collections import Counter

c = Counter(lst)

[(item,c[item]) for item in list(OrderedDict.fromkeys(lst))]
# [(datetime.date(2017, 3, 9), 1), (datetime.date(2017, 3, 10), 2), (datetime.date(2017, 3, 11), 1)]

Or use an OrderedCounter .

EDIT: see the excellent benchmark by @MSeifert .

You can use a Counter to count then uniqify the original list to maintain order while adding the count.

Given:

>>> dates=[datetime.date(2017, 3, 9), datetime.date(2017, 3, 10), datetime.date(2017, 3, 10), datetime.date(2017, 3, 11)]  

You can do:

from collections import Counter

cnts=Counter(dates)
seen=set()
>>> [(e, cnts[e]) for e in dates if not (e in seen or seen.add(e))]
[(datetime.date(2017, 3, 9), 1), (datetime.date(2017, 3, 10), 2), (datetime.date(2017, 3, 11), 1)]

Update

You can also sort a Counter back into the order of the original list by using a key function to get the index of the first entry of date(X) in that list:

sorted([(k,v) for k,v in Counter(dates).items()], key=lambda t: dates.index(t[0])) 

(The speed of this is correlated to how ordered or unordered your list is...)


Did someone say timeit!

Here are some timings with a larger example (400,000 dates):

from __future__ import print_function
import datetime
from collections import Counter
from collections import OrderedDict

def dawg1(dates):
    seen=set()
    cnts=Counter(dates)
    return [(e, cnts[e]) for e in dates if not (e in seen or seen.add(e))]

def od_(dates):    
    odct = OrderedDict()
    for item in dates:
        try:
            odct[item] += 1
        except KeyError:
            odct[item] = 1
    return odct

def lc_(lst):
    return [(item,lst.count(item)) for item in list(OrderedDict.fromkeys(lst))]    

def dawg2(dates):
    return sorted([(k,v) for k,v in Counter(dates).items()], key=lambda t: dates.index(t[0]))    

if __name__=='__main__':
    import timeit  
    dates=[datetime.date(2017, 3, 9), datetime.date(2017, 3, 10), datetime.date(2017, 3, 10), datetime.date(2017, 3, 11)]*100000
    for f in (dawg, od_, lc_,sort_):
        print("   {:^10s}{:.4f} secs {}".format(f.__name__, timeit.timeit("f(dates)", setup="from __main__ import f, dates", number=100),f(dates))) 

Prints (on Python 2.7):

 dawg1   10.7253 secs [(datetime.date(2017, 3, 9), 100000), (datetime.date(2017, 3, 10), 200000), (datetime.date(2017, 3, 11), 100000)]
  od_    21.8186 secs OrderedDict([(datetime.date(2017, 3, 9), 100000), (datetime.date(2017, 3, 10), 200000), (datetime.date(2017, 3, 11), 100000)])
  lc_    17.0879 secs [(datetime.date(2017, 3, 9), 100000), (datetime.date(2017, 3, 10), 200000), (datetime.date(2017, 3, 11), 100000)]
 dawg2   8.6058 secs [(datetime.date(2017, 3, 9), 100000), (datetime.date(2017, 3, 10), 200000), (datetime.date(2017, 3, 11), 100000)]0000)]

PyPy:

 dawg1   7.1483 secs [(datetime.date(2017, 3, 9), 100000), (datetime.date(2017, 3, 10), 200000), (datetime.date(2017, 3, 11), 100000)]
  od_    4.7551 secs OrderedDict([(datetime.date(2017, 3, 9), 100000), (datetime.date(2017, 3, 10), 200000), (datetime.date(2017, 3, 11), 100000)])
  lc_    27.8438 secs [(datetime.date(2017, 3, 9), 100000), (datetime.date(2017, 3, 10), 200000), (datetime.date(2017, 3, 11), 100000)]
 dawg2   4.7673 secs [(datetime.date(2017, 3, 9), 100000), (datetime.date(2017, 3, 10), 200000), (datetime.date(2017, 3, 11), 100000)]

Python 3.6:

 dawg1   3.4944 secs [(datetime.date(2017, 3, 9), 100000), (datetime.date(2017, 3, 10), 200000), (datetime.date(2017, 3, 11), 100000)]
  od_    4.6541 secs OrderedDict([(datetime.date(2017, 3, 9), 100000), (datetime.date(2017, 3, 10), 200000), (datetime.date(2017, 3, 11), 100000)])
  lc_    2.7440 secs [(datetime.date(2017, 3, 9), 100000), (datetime.date(2017, 3, 10), 200000), (datetime.date(2017, 3, 11), 100000)]
 dawg2   2.1330 secs [(datetime.date(2017, 3, 9), 100000), (datetime.date(2017, 3, 10), 200000), (datetime.date(2017, 3, 11), 100000)]

Best.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM