简体   繁体   中英

Why python numpy is slowly?

example, for gr = np.array([5, 4, 3, 5, 2]) and genx = np.array(["femy_gen_m", "my_gen_m", "my_gen_m", "femy_gen_m", "my_gen_m"]) , the output is {'my_gen_m': 3.0, 'femy_gen_m': 5.0} . Hint. use mean from numpy .

I write the function for already written unittest by teacher, but face with slow function processing.

Attached my code below.

from timeit import timeit
import numpy as np


#mycode
def mean_by_redneg(gr, genx):
    result = {}
    my_gen_m_sum, femy_gen_m_sum = [], []
    for index, element in enumerate(genx):
        if element == 'my_gen_m':
            my_gen_m_sum.append(gr[index])
        if element == 'femy_gen_m':
            femy_gen_m_sum.append(gr[index])
    result['my_gen_m'] = np.asarray(my_gen_m_sum).mean()
    result['femy_gen_m'] = np.asarray(femy_gen_m_sum).mean()
    return result

#check the function
def test(gr, genx, outp):
    ret = mean_by_redneg(np.array(gr), np.array(genx))
    assert np.isclose(ret['femy_gen_m'], outp['femy_gen_m'])
    assert np.isclose(ret['my_gen_m'], outp['my_gen_m'])


test([5, 4, 3, 5, 2], ["femy_gen_m", "my_gen_m", "my_gen_m", "femy_gen_m", "my_gen_m"], {'my_gen_m': 3.0, 'femy_gen_m': 5.0})
test([1, 0] * 10, ['femy_gen_m', 'my_gen_m'] * 10, {'femy_gen_m': 1, 'my_gen_m': 0})
test(range(100), ['femy_gen_m', 'my_gen_m'] * 50, {'femy_gen_m': 49.0, 'my_gen_m': 50.0})
test(list(range(100)) + [100], ['my_gen_m'] * 100 + ['femy_gen_m'], {'my_gen_m': 49.5, 'femy_gen_m': 100.0})


def bm_test(a, b):
    xx = 0
    yy = 0
    im = 0
    fi = 0
    for x, y in zip(a, b):
        if x != y:
            xx += x
            yy += x
            im += 1
            fi += 1
    return xx + yy


N = int(1E5)

gr = np.array([1.1] * N + [2.2] * N)
genx = np.array(['my_gen_m'] * N + ['femy_gen_m'] * N)

bm = timeit("assert np.isclose(mean_by_redneg(gr, genx)['my_gen_m'], 1.1)",
                   "from __main__ import np, mean_by_redneg, gr, genx",
                   number=1)
reference_bm = timeit("bm_test(gr, genx)",
                             "from __main__ import bm_test, gr, genx",
                             number=1)

assert reference_bm > bm * 10, "too slow"

Do you have any idea how to do that work faster? ps Thank you for your time

The vectorized way to do this in numpy is much simpler than your loopy code. The heart of it would be something like:

out = {}
for gen in ['Male', 'Female']:
    out[gen] = grades[genders == gen].mean()

How this works:

genders == gen resolves to an array of True and False called a 'boolean index'. when grades is indexed by it, it returns the values of grades that correspond to the locations in the index that are True . So when gen is 'Male' , grades[genders == gen] corresponds to the grades corresponding to 'Male' s. Once you've resolved that array, you use its .mean() method to calculate the mean, and assign it to the dictionary.

This is significantly faster since the iteration/indexing part is completed in the compiled c code that is the backend of numpy , instead of interpreted python code.

Use the following function:

def mean_by_gender2(grades, genders):
    return { g: grades[genders == g].mean() for g in np.unique(genders) }

Comparison of execution times (using %timeit ) shows:

  1. Admittedly, for a very short test data (5 items in each array), your solution is faster (yours: 36 µs and mine: 52.9 µs).

  2. But if you take a longer test data (100 items in each array), then my solution is better (yours: 99.5 µs and mine: 62.6 µs).

For yet longer source data the advantage of my solution should be more apparent.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM