简体   繁体   English

如何有效地处理 Python 中不断附加新项目的列表

[英]How to efficiently process a list that continously being appended with new item in Python

Objective:客观的:

To visualize the population size of a particular organism over finite time.可视化特定生物体在有限时间内的种群规模。

Assumptions:假设:

  • The organism has a life span of age_limit days该生物体的寿命为age_limit
  • Only Females of age day_lay_egg days can lay the egg, and the female is allowed to lay an egg a maximum of max_lay_egg times.只有年龄为day_lay_egg天的雌性才能产卵,雌性最多可以max_lay_egg次。 Each breeding session, a maximum of only egg_no eggs can be laid with a 50% probability of producing male offspring.每次繁殖session次,最多只能egg_no个卵,有50%的概率产下雄性后代。
  • Initial population of 3 organisms consist of 2 Female and 1 Male 3 种生物的初始种群由 2 名女性和 1 名男性组成

Code Snippets:代码片段:

Currently, the code below should produced the expected output目前,下面的代码应该产生预期的 output

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns


def get_breeding(d,**kwargs):

    if d['lay_egg'] <= kwargs['max_lay_egg'] and d['dborn'] > kwargs['day_lay_egg'] and d['s'] == 1:
            nums = np.random.choice([0, 1], size=kwargs['egg_no'], p=[.5, .5]).tolist()
            npol=[dict(s=x,d=d['d'], lay_egg=0, dborn=0) for x in nums]
            d['lay_egg'] = d['lay_egg'] + 1
            return d,npol

    return d,None



def to_loop_initial_population(**kwargs):

    npol=kwargs['ipol']
    nday = 0
    total_population_per_day = []
    while nday < kwargs['nday_limit']:
        # print(f'Executing day {nday}')

        k = []
        for dpol in npol:
            dpol['d'] += 1
            dpol['dborn'] += 1
            dpol,h = get_breeding(dpol,**kwargs)

            if h is None and dpol['dborn'] <= kwargs['age_limit']:
                # If beyond the age limit, ignore the parent and update only the decedent 
                k.append(dpol)
            elif isinstance(h, list) and dpol['dborn'] <= kwargs['age_limit']:
                # If below age limit, append the parent and its offspring
                h.extend([dpol])
                k.extend(h)

        total_population_per_day.append(dict(nsize=len(k), day=nday))
        nday += 1
        npol = k

    return total_population_per_day


## Some spec and store all  setting in a dict   
numsex=[1,1,0] # 0: Male, 1: Female

# s: sex, d: day, lay_egg: Number of time the female lay an egg, dborn: The organism age
ipol=[dict(s=x,d=0, lay_egg=0, dborn=0) for x in numsex] # The initial population
age_limit = 45 # Age limit for the species
egg_no=3 # Number of eggs
day_lay_egg = 30  # Matured age for egg laying
nday_limit=360
max_lay_egg=2
para=dict(nday_limit=nday_limit,ipol=ipol,age_limit=age_limit,
          egg_no=egg_no,day_lay_egg=day_lay_egg,max_lay_egg=max_lay_egg)


dpopulation = to_loop_initial_population(**para)


### make some plot
df = pd.DataFrame(dpopulation)
sns.lineplot(x="day", y="nsize", data=df)
plt.xticks(rotation=15)
plt.title('Day vs population')
plt.show()

Output: Output:

Problem/Question:问题/疑问:

The time to complete the execution time increases exponentially with nday_limit .完成执行时间的时间随nday_limit呈指数增长。 I need to improve the efficiency of the code.我需要提高代码的效率。 How can I speed up the running time?我怎样才能加快运行时间?

Other Thoughts:其他想法:

I am tempted to apply joblib as below.我很想按如下方式应用joblib To my surprise, the execution time is worse.令我惊讶的是,执行时间更糟。

def djob(dpol,k,**kwargs):
    dpol['d'] = dpol['d'] + 1
    dpol['dborn'] = dpol['dborn'] + 1
    dpol,h = get_breeding(dpol,**kwargs)

    if h is None and dpol['dborn'] <= kwargs['age_limit']:
        # If beyond the age limit, ignore the that particular subject
        k.append(dpol)
    elif isinstance(h, list) and dpol['dborn'] <= kwargs['age_limit']:
        # If below age limit, append the parent and its offspring
        h.extend([dpol])
        k.extend(h)

    return k
def to_loop_initial_population(**kwargs):

    npol=kwargs['ipol']
    nday = 0
    total_population_per_day = []
    while nday < kwargs['nday_limit']:


        k = []


        njob=1 if len(npol)<=50 else 4
        if njob==1:
            print(f'Executing day {nday} with single cpu')
            for dpols in npol:
                k=djob(dpols,k,**kwargs)
        else:
            print(f'Executing day {nday} with single parallel')
            k=Parallel(n_jobs=-1)(delayed(djob)(dpols,k,**kwargs) for dpols in npol)
            k = list(itertools.chain(*k))
            ll=1


        total_population_per_day.append(dict(nsize=len(k), day=nday))
        nday += 1
        npol = k

    return total_population_per_day

for为了

nday_limit=365

Your code looks alright overall but I can see several points of improvement that are slowing your code down significantly.您的代码总体上看起来不错,但我可以看到几个改进点正在显着降低您的代码速度。

Though it must be noted that you can't really help the code slowing down too much with increasing nday values, since the population you need to keep track of keeps growing and you keep re-populating a list to track this.但必须注意的是,随着 nday 值的增加,您无法真正帮助代码减慢太多速度,因为您需要跟踪的人口不断增长,并且您不断重新填充列表来跟踪这一点。 It's expected as the number of objects increase, the loops will take longer to complete, but you can reduce the time it takes to complete a single loop.预计随着对象数量的增加,循环将需要更长的时间才能完成,但您可以减少完成单个循环所需的时间。

elif isinstance(h, list) and dpol['dborn'] <= kwargs['age_limit']:

Here you ask the instance of h every single loop, after confirming whether it's None.这里你每次循环询问h的实例,在确认它是否为None之后。 You know for a fact that h is going to be a list, and if not, your code will error anyway even before reaching that line for the list not to have been able to be created.您知道h将成为一个列表这一事实,如果不是,您的代码将在到达无法创建列表的那一行之前无论如何都会出错。

Furthermore, you have a redundant condition check for age of dpol , and then redundantly first extend h by dpol and then k by h .此外,您对dpolage进行了冗余条件检查,然后冗余地首先将h扩展dpol然后将k扩展h This can be simplified together with the previous issue to this:这个可以连同上一期一起简化为:

if dpol['dborn'] <= kwargs['age_limit']:
    k.append(dpol)

if h:
    k.extend(h)

The results are identical.结果是相同的。

Additionally, you're passing around a lot of **kwargs .此外,您传递了很多**kwargs This is a sign that your code should be a class instead, where some unchanging parameters are saved through self.parameter .这表明您的代码应该是 class,其中一些不变的参数通过self.parameter保存。 You could even use a dataclass here ( https://docs.python.org/3/library/dataclasses.html )您甚至可以在此处使用数据类 ( https://docs.python.org/3/library/dataclasses.html )

Also, you mix responsibilities of functions which is unnecessary and makes your code more confusing.此外,您混合了不必要的功能职责,并使您的代码更加混乱。 For instance:例如:

def get_breeding(d,**kwargs):

    if d['lay_egg'] <= kwargs['max_lay_egg'] and d['dborn'] > kwargs['day_lay_egg'] and d['s'] == 1:
            nums = np.random.choice([0, 1], size=kwargs['egg_no'], p=[.5, .5]).tolist()
            npol=[dict(s=x,d=d['d'], lay_egg=0, dborn=0) for x in nums]
            d['lay_egg'] = d['lay_egg'] + 1
            return d,npol

    return d,None

This code contains two responsibilities: Generating a new individual if conditions are met, and checking these conditions, and returning two different things based on them.这段代码包含两个职责:如果满足条件则生成一个新的个体,并检查这些条件,并根据它们返回两个不同的东西。

This would be better done through two separate functions, one which simply checks the conditions, and another that generates a new individual as follows:这最好通过两个单独的函数来完成,一个函数简单地检查条件,另一个函数生成一个新的个体,如下所示:

def check_breeding(d, max_lay_egg, day_lay_egg):
    return d['lay_egg'] <= max_lay_egg and d['dborn'] > day_lay_egg and d['s'] == 1


def get_breeding(d, egg_no):
    nums = np.random.choice([0, 1], size=egg_no, p=[.5, .5]).tolist()
    npol=[dict(s=x, d=d['d'], lay_egg=0, dborn=0) for x in nums]
    return npol

Where d['lay_egg'] could be updated in-place when iterating over the list if the condition is met.如果满足条件,则在遍历列表时可以就地更新d['lay_egg']

You could speed up your code even further this way, if you edit the list as you iterate over it (it is not typically recommended but it's perfectly fine to do if you know what you're doing. Make sure to do it by using the index and limit it to the previous bounds of the length of the list, and decrement the index when an element is removed)如果您在迭代列表时编辑列表,您可以通过这种方式进一步加快代码速度(通常不推荐这样做,但如果您知道自己在做什么,这样做是完全没问题的。确保通过使用index 并将其限制为列表长度的先前边界,并在删除元素时递减索引)

Example:例子:

i = 0
maxiter = len(npol)
while i < maxiter:
    if check_breeding(npol[i], max_lay_egg, day_lay_egg):
        npol.extend(get_breeding(npol[i], egg_no))
    
    if npol[i]['dborn'] > age_limit:
            npol.pop(i)
            i -= 1
            maxiter -= 1

Which could significantly reduce processing time since you're not making a new list and appending all elements all over again every iteration.这可以显着减少处理时间,因为您没有制作新列表并在每次迭代中重新附加所有元素。

Finally, you could check some population growth equation and statistical methods, and you could even reduce this whole code to a calculation problem with iterations, though that wouldn't be a sim anymore.最后,你可以检查一些人口增长方程和统计方法,你甚至可以将整个代码简化为一个带有迭代的计算问题,尽管那不再是模拟了。

Edit编辑

I've fully implemented my suggestions for improvements to your code and timed them in a jupyter notebook using %%time .我已经完全实施了我对改进代码的建议,并使用%%time在 jupyter notebook 中对它们进行了计时。 I've separated out function definitions from both so they wouldn't contribute to the time, and the results are telling.我已经从两者中分离出 function 定义,这样它们就不会占用时间,而且结果很能说明问题。 I also made it so females produce another female 100% of the time, to remove randomness, otherwise it would be even faster.我还让雌性在 100% 的时间内生产另一只雌性,以消除随机性,否则它会更快。 I compared the results from both to verify they produce identical results (they do, but I removed the 'd_born' parameter cause it's not used in the code apart from setting).我比较了两者的结果以验证它们产生了相同的结果(它们确实产生了相同的结果,但我删除了 'd_born' 参数,因为它除了设置之外没有在代码中使用)。

Your implementation, with nday_limit=100 and day_lay_egg=15 :您的实施, nday_limit=100day_lay_egg=15
Wall time 23.5s挂墙时间 23.5s

My implementation with same parameters:我使用相同参数的实现:
Wall time 18.9s挂墙时间 18.9s

So you can tell the difference is quite significant, which grows even farther apart for larger nday_limit values.所以你可以看出差异非常显着,对于更大的nday_limit值,差异会变得更远。

Full implementation of edited code:编辑代码的完整实现:

from dataclasses import dataclass
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns


@dataclass
class Organism:
    sex: int
    times_laid_eggs: int = 0
    age: int = 0

    def __init__(self, sex):
        self.sex = sex


def check_breeding(d, max_lay_egg, day_lay_egg):
    return d.times_laid_eggs <= max_lay_egg and d.age > day_lay_egg and d.sex == 1


def get_breeding(egg_no): # Make sure to change probabilities back to 0.5 and 0.5 before using it
    nums = np.random.choice([0, 1], size=egg_no, p=[0.0, 1.0]).tolist()
    npol = [Organism(x) for x in nums]
    return npol


def simulate(organisms, age_limit, egg_no, day_lay_egg, max_lay_egg, nday_limit):
    npol = organisms
    nday = 0
    total_population_per_day = []

    while nday < nday_limit:
        i = 0
        maxiter = len(npol)
        while i < maxiter:
            npol[i].age += 1
            
            if check_breeding(npol[i], max_lay_egg, day_lay_egg):
                npol.extend(get_breeding(egg_no))
                npol[i].times_laid_eggs += 1

            if npol[i].age > age_limit:
                npol.pop(i)
                maxiter -= 1
                continue

            i += 1

        total_population_per_day.append(dict(nsize=len(npol), day=nday))
        nday += 1

    return total_population_per_day


if __name__ == "__main__":
    numsex = [1, 1, 0]  # 0: Male, 1: Female

    ipol = [Organism(x) for x in numsex]  # The initial population
    age_limit = 45  # Age limit for the species
    egg_no = 3  # Number of eggs
    day_lay_egg = 15  # Matured age for egg laying
    nday_limit = 100
    max_lay_egg = 2

    dpopulation = simulate(ipol, age_limit, egg_no, day_lay_egg, max_lay_egg, nday_limit)

    df = pd.DataFrame(dpopulation)
    sns.lineplot(x="day", y="nsize", data=df)
    plt.xticks(rotation=15)
    plt.title('Day vs population')
    plt.show()

Try structuring your code as a matrix like state[age][eggs_remaining] = count instead.尝试将代码结构化为矩阵,例如state[age][eggs_remaining] = count It will have age_limit rows and max_lay_egg columns.它将有age_limit行和max_lay_egg列。

Males start in the 0 eggs_remaining column, and every time a female lays an egg they move down one (3->2->1->0 with your code above).雄性从0 eggs_remaining列开始,每次雌性产卵时,它们都会向下移动一个(上面的代码为 3->2->1->0)。

For each cycle, you just drop the last row, iterate over all the rows after age_limit and insert a new first row with the number of males and females.对于每个周期,您只需删除最后一行,遍历age_limit之后的所有行,并插入一个新的第一行,其中包含男性和女性的数量。

If (as in your example) there only is a vanishingly small chance that a female would die of old age before laying all their eggs, you can just collapse everything into a state_alive[age][gender] = count and a state_eggs[eggs_remaining] = count instead, but it shouldn't be necessary unless the age goes really high or you want to run thousands of simulations.如果(如您的示例所示)女性在产下所有卵之前死于老年的可能性微乎其微,那么您可以将所有内容折叠成state_alive[age][gender] = countstate_eggs[eggs_remaining] = count ,但除非年龄真的很高或者您想运行数千次模拟,否则没有必要。

use numpy array operation as much as possible instead of using loop can improve your performance, see below codes tested in notebook - https://www.kaggle.com/gfteafun/notebook03118c731b尽可能使用 numpy 数组操作而不是使用循环可以提高性能,请参阅下面在笔记本中测试的代码 - https://www.kaggle.com/gfteafun/notebook03118c731b

Note that when comparing the time the nsize scale matters.请注意,在比较时间时,nsize 比例很重要。

%%time​
​
# s: sex, d: day, lay_egg: Number of time the female lay an egg, dborn: The organism age
x = np.array([(x, 0, 0, 0) for x in numsex ] )
iparam = np.array([0, 1, 0, 1])
​
total_population_per_day = []
for nday in range(nday_limit):
    x = x + iparam
    c = np.all(x < np.array([2, nday_limit, max_lay_egg, age_limit]), axis=1) & np.all(x >= np.array([1, day_lay_egg, 0, day_lay_egg]), axis=1)
    total_population_per_day.append(dict(nsize=len(x[x[:,3]<age_limit, :]), day=nday))
    n = x[c, 2].shape[0]
​
    if n > 0:
        x[c, 2] = x[c, 2] + 1
        newborns = np.array([(x, nday, 0, 0) for x in np.random.choice([0, 1], size=egg_no, p=[.5, .5]) for i in range(n)])
        x = np.vstack((x, newborns))
​
​
df = pd.DataFrame(total_population_per_day)
sns.lineplot(x="day", y="nsize", data=df)
plt.xticks(rotation=15)
plt.title('Day vs population')
plt.show()

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM