简体   繁体   中英

Looking for a more efficient/pythonic way to sum tuples in a list, and compute an average

I am trying to do some basic computations with data from the web. For this cause, I have found some code that extracts begin and end years for Rembrandt works. It saves it in a list

date_list =[(work['datebegin'], work['dateend']) for work in `rembrandt2_parsed['records']]`

date_list is a list containing the tuples with begin and end years for some Rembrandt works in the Harvard Art Museum. For the sake of completeness, it looks like this:

[(0, 0), (1648, 1648), (1637, 1647), (1626, 1636), (0, 0), (1638, 1638), (1635, 1635), (1634, 1634), (0, 0), (0, 0)]

Now I want to do some basic computations, I want to sum over this list of tuples, and compute the average of the years when they are not null . I came up with a solution:

datebegin =0
date_end =0
count_begin =0
count_end =0

for x, y in date_list:
    if x !=0:
        datebegin +=x
        count_begin +=1
    if y != 0:
        date_end +=y
        count_end +=1

final_date_begin = datebegin/count_begin #value = year 1636
final_date_end = date_end/count_end #value = year 1639

But I think this can be done much more efficient/pythonic. In the first place because I seem to need a lot of code for such a simple task, and in the second place because I need to initialize 4(!) global vars if I do it in this way. Could someone enlighten me and show me a more efficient way to solve this?

Non-numpy solution:

lst = [(0, 0), (1648, 1648), (1637, 1647), (1626, 1636), (0, 0), (1638, 1638), (1635, 1635), (1634, 1634), (0, 0), (0, 0)]

print(sum(x[0] for x in lst) / sum(x[0] != 0 for x in lst))
# 1636.3333333333333
print(sum(x[1] for x in lst) / sum(x[1] != 0 for x in lst))
# 1639.6666666666667

Numpy and list comprehensions are your friend here.

import numpy as np  
date_list = [(0, 0), (1648, 1648), (1637, 1647), (1626, 1636), (0, 0), 
             (1638, 1638), (1635, 1635), (1634, 1634), (0, 0), (0, 0)]
final_date_begin = np.mean([x for x, y in date_list if not x == 0])
final_date_end = np.mean([y for x, y in date_list if not y == 0])

In pure Python

starts = [s for s, e in date_list for if s and e]
ends = [e for s, e in date_list for if s and e]

start_avg = sum(starts) / len(starts)
end_avg = sum(ends) / len(ends)

You can use numpy to solve this:

import numpy as np

result = list(np.ma.masked_equal(date_list, 0).mean(axis=0))

Here we thus first store the date_list in an array, next we mask out the zero values, and then we calculate the average over the first axis.

For your sample data, we obtain:

>>> list(np.ma.masked_equal(date_list, 0).mean(axis=0))
[1636.3333333333333, 1639.6666666666667]

Performance : for a list containing 100'000 2-tuples, generated with:

from random import randint

date_list = [(randint(0, 10), randint(0, 10)) for _ in range(100000)]

we repeated this function 1'000 times, and obtain:

>>> timeit(f, number=1000)
51.31010195999988

so locally, this works for a 100'000×2 "matrix" in 51.3 ms per run.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM