[英]Optimal/fastest way to count dates in a python list
我有一个日期列表,目标是计算每个日期的出现次数, 同时保持它们在原始列表中的显示顺序 。 请考虑以下示例:
列表only_dates
看起来像这样:
[datetime.date(2017, 3, 9), datetime.date(2017, 3, 10), datetime.date(2017, 3, 10), datetime.date(2017, 3, 11)]
我正在尝试使用groupby
:
import itertools
day_wise_counts = [(k, len(list(g))) for k, g in itertools.groupby(only_dates)]
print(str(day_wise_counts))
这打印
[(datetime.date(2017, 3, 10), 1), (datetime.date(2017, 3, 9), 1), (datetime.date(2017, 3, 10), 1), (datetime.date(2017, 3, 11), 1)]
我理解这种情况正在发生,因为最终每个日期对象在分组时被视为不同的日期对象。
我期待输出为:
[(datetime.date(2017, 3, 9), 1), (datetime.date(2017, 3, 10), 2), (datetime.date(2017, 3, 11), 1)]
我不一定在寻找元组列表。 只要保持原始日期顺序,字典输出也就足够了。 (可能是OrderedDict
)。
我怎样才能做到这一点?
更新:有可能建议多种方法都能正常运行。 但我应该提到我将为大量数据执行此操作。 因此,如果您的解决方案在运行时间方面是最佳的,那就太好了。 如果可以,请相应地编辑您的答案/评论。
更新2:数据大小可以达到100万行。
实际上,您可以使用OrderedDict
:
from collections import OrderedDict
import datetime
inp = [datetime.date(2017, 3, 9), datetime.date(2017, 3, 10),
datetime.date(2017, 3, 10), datetime.date(2017, 3, 11)]
odct = OrderedDict()
for item in inp:
try:
odct[item] += 1
except KeyError:
odct[item] = 1
print(odct)
打印:
OrderedDict([(datetime.date(2017, 3, 9), 1),
(datetime.date(2017, 3, 10), 2),
(datetime.date(2017, 3, 11), 1)])
你还要求时间安排,所以他们在这里:
from collections import OrderedDict, Counter
import datetime
import random
# Functions
def ordereddict(inp):
odct = OrderedDict()
for item in inp:
try:
odct[item] += 1
except KeyError:
odct[item] = 1
return odct
def dawg(inp):
cnts=Counter(inp)
seen=set()
return [(e, cnts[e]) for e in inp if not (e in seen or seen.add(e))]
def chris1(inp):
return [(item, inp.count(item)) for item in list(OrderedDict.fromkeys(inp))]
def chris2(inp):
c = Counter(inp)
return [(item,c[item]) for item in list(OrderedDict.fromkeys(inp))]
# Taken from answer: https://stackoverflow.com/a/23747652/5393381
class OrderedCounter(Counter, OrderedDict):
'Counter that remembers the order elements are first encountered'
def __repr__(self):
return '%s(%r)' % (self.__class__.__name__, OrderedDict(self))
def __reduce__(self):
return self.__class__, (OrderedDict(self),)
# Timing setup
timings = {ordereddict: [], dawg: [], chris1: [], chris2: [], OrderedCounter: []}
sizes = [2**i for i in range(1, 20)]
# Timing
for size in sizes:
func_input = [datetime.date(2017, random.randint(1, 12), random.randint(1, 28)) for _ in range(size)]
for func in timings:
res = %timeit -o func(func_input) # if you use IPython, otherwise use the "timeit" module
timings[func].append(res)
并绘制:
%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure(1)
ax = plt.subplot(111)
for func in timings:
ax.plot([2**i for i in range(1, 20)],
[time.best for time in timings[func]],
label=str(func.__name__))
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('size')
ax.set_ylabel('time [seconds]')
ax.grid(which='both')
ax.legend()
plt.tight_layout()
我在Python-3.5上计时。 在python-2.x上使用Counter
的方法可能会慢一些( Counter
已针对python-3.x进行了优化)。 chris2
和dawg
方法也相互重叠(因为它们之间几乎没有时间差异)。
因此除了@Chris_Rands和OrderedCounter
的第一种方法之外 - 这些方法的表现非常相似,并且主要取决于列表中重复的数量。
这主要是1.5-2的差异因素。 在3个“快速”方法中,我找不到100万个项目的任何实时时差。
您可以使用list.count()
和列表推导迭代从唯一有序日期的OrderedDict
派生的列表:
import datetime
from collections import OrderedDict
lst = [datetime.date(2017, 3, 9), datetime.date(2017, 3, 10), datetime.date(2017, 3, 10), datetime.date(2017, 3, 11)]
[(item,lst.count(item)) for item in list(OrderedDict.fromkeys(lst))]
# [(datetime.date(2017, 3, 9), 1), (datetime.date(2017, 3, 10), 2), (datetime.date(2017, 3, 11), 1)]
或者类似地使用collections.Counter
而不是list.count
:
from collections import Counter
c = Counter(lst)
[(item,c[item]) for item in list(OrderedDict.fromkeys(lst))]
# [(datetime.date(2017, 3, 9), 1), (datetime.date(2017, 3, 10), 2), (datetime.date(2017, 3, 11), 1)]
或者使用OrderedCounter 。
编辑:请参阅@MSeifert的优秀基准。
您可以使用计数器进行计数,然后在添加计数时统一原始列表以维持顺序。
鉴于:
>>> dates=[datetime.date(2017, 3, 9), datetime.date(2017, 3, 10), datetime.date(2017, 3, 10), datetime.date(2017, 3, 11)]
你可以做:
from collections import Counter
cnts=Counter(dates)
seen=set()
>>> [(e, cnts[e]) for e in dates if not (e in seen or seen.add(e))]
[(datetime.date(2017, 3, 9), 1), (datetime.date(2017, 3, 10), 2), (datetime.date(2017, 3, 11), 1)]
更新
您还可以使用键函数将计数器排序回原始列表的顺序,以获取该列表中第一个日期(X)条目的索引:
sorted([(k,v) for k,v in Counter(dates).items()], key=lambda t: dates.index(t[0]))
(此速度与您的列表的排序或无序方式相关......)
有人说时间!
以下是一些更大的例子(400,000个日期):
from __future__ import print_function
import datetime
from collections import Counter
from collections import OrderedDict
def dawg1(dates):
seen=set()
cnts=Counter(dates)
return [(e, cnts[e]) for e in dates if not (e in seen or seen.add(e))]
def od_(dates):
odct = OrderedDict()
for item in dates:
try:
odct[item] += 1
except KeyError:
odct[item] = 1
return odct
def lc_(lst):
return [(item,lst.count(item)) for item in list(OrderedDict.fromkeys(lst))]
def dawg2(dates):
return sorted([(k,v) for k,v in Counter(dates).items()], key=lambda t: dates.index(t[0]))
if __name__=='__main__':
import timeit
dates=[datetime.date(2017, 3, 9), datetime.date(2017, 3, 10), datetime.date(2017, 3, 10), datetime.date(2017, 3, 11)]*100000
for f in (dawg, od_, lc_,sort_):
print(" {:^10s}{:.4f} secs {}".format(f.__name__, timeit.timeit("f(dates)", setup="from __main__ import f, dates", number=100),f(dates)))
打印(在Python 2.7上):
dawg1 10.7253 secs [(datetime.date(2017, 3, 9), 100000), (datetime.date(2017, 3, 10), 200000), (datetime.date(2017, 3, 11), 100000)]
od_ 21.8186 secs OrderedDict([(datetime.date(2017, 3, 9), 100000), (datetime.date(2017, 3, 10), 200000), (datetime.date(2017, 3, 11), 100000)])
lc_ 17.0879 secs [(datetime.date(2017, 3, 9), 100000), (datetime.date(2017, 3, 10), 200000), (datetime.date(2017, 3, 11), 100000)]
dawg2 8.6058 secs [(datetime.date(2017, 3, 9), 100000), (datetime.date(2017, 3, 10), 200000), (datetime.date(2017, 3, 11), 100000)]0000)]
PyPy:
dawg1 7.1483 secs [(datetime.date(2017, 3, 9), 100000), (datetime.date(2017, 3, 10), 200000), (datetime.date(2017, 3, 11), 100000)]
od_ 4.7551 secs OrderedDict([(datetime.date(2017, 3, 9), 100000), (datetime.date(2017, 3, 10), 200000), (datetime.date(2017, 3, 11), 100000)])
lc_ 27.8438 secs [(datetime.date(2017, 3, 9), 100000), (datetime.date(2017, 3, 10), 200000), (datetime.date(2017, 3, 11), 100000)]
dawg2 4.7673 secs [(datetime.date(2017, 3, 9), 100000), (datetime.date(2017, 3, 10), 200000), (datetime.date(2017, 3, 11), 100000)]
Python 3.6:
dawg1 3.4944 secs [(datetime.date(2017, 3, 9), 100000), (datetime.date(2017, 3, 10), 200000), (datetime.date(2017, 3, 11), 100000)]
od_ 4.6541 secs OrderedDict([(datetime.date(2017, 3, 9), 100000), (datetime.date(2017, 3, 10), 200000), (datetime.date(2017, 3, 11), 100000)])
lc_ 2.7440 secs [(datetime.date(2017, 3, 9), 100000), (datetime.date(2017, 3, 10), 200000), (datetime.date(2017, 3, 11), 100000)]
dawg2 2.1330 secs [(datetime.date(2017, 3, 9), 100000), (datetime.date(2017, 3, 10), 200000), (datetime.date(2017, 3, 11), 100000)]
最好。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.