简体   繁体   English

花费太长时间来运行python脚本

[英]Taking too long to run a python script

The script is basically measuring the relationship between students based on the book they have borrowed (code). 脚本基本上是根据借用的书(代码)来衡量学生之间的关系。 So I've built a tree for different type of books using ete2 package. 因此,我使用ete2包为不同类型的书构建了一棵树。 Now I'm trying to write a piece of code that takes the data from the tree and a csv file and does some data analysis through the function relationship.The csv file contains more than 50,000 rows. 现在我正在尝试编写一段代码,该代码从树和一个csv文件中获取数据,并通过函数关系进行一些数据分析.csv文件包含50,000多行。 The problem is that it takes to long to run the code (around 7 days), while it uses only 10 to 20% of my computer CPU and memory. 问题是运行代码需要花费很长时间(大约7天),而它仅占用计算机CPU和内存的10%到20%。

Here is an example of the csv file I've used: 这是我使用的csv文件的示例:

ID Code    Count 
1    A1...   6
1    A2...   5
2    A....   4
2    D....   1
2    A1...   2
3    D....   5
3    D1...   3
3    D2...   5

Here is the code: 这是代码:

from ete2 import Tree
import pandas as pd
import numpy as np
from __future__ import division
import math


data= pd.read_csv('data.csv', names=['ID','Code', 'Count'])
codes_list= list (set(data['Code']))
total_codes= data.shape[0]
students_list= list (set(data['ID']))



####################################


# generate the tree
t = Tree (".....;", format =0)
for i in codes_list:
    if '....' in i:
        node = t.search_nodes(name = '.....')
        node[0].add_child(name= i)
for i in codes_list:
    if '...' in i and '....' not in i:
        if i[0]+'....' in codes_list:
            node = t.search_nodes(name = i[0]+'....')
            node[0].add_child(name= i)
        else:
            node = t.search_nodes(name = '.....')
            node[0].add_child(name= i)

# save the tree in a file 
t.write( outfile= file_path + "Codes_tree.nh", format =3)
return t.get_ascii(show_internal=True)

####################################

def relationship(code1,code2):

    code1_ancestors= t.search_nodes(name=code1)[0].get_ancestors()
    code2_ancestors=t.search_nodes(name=code2)[0].get_ancestors(
    common_ancestors = []
    for a1 in code1_ancestors:
        for a2 in code2_ancestors:
            if a1==a2:
                common_ancestors.append(a1)
    IC_values = []
    for ca in common_ancestors:
        code_descendants=[]
        for gd in ca.get_descendants():
            code_descendants.append(gd.name)
        code_descendants.append(ca)
        frequency= 0
        for k in code_descendants:
                frequency= frequency + code_count.Count[k]

        IC = - (math.log (frequency / float (total_codes)))
        IC_values.append (IC)

    IC_max= max(IC_values)
    return IC_max

##################

relationship_matrix = pd.DataFrame(index=[students_list], columns=[students_list])
for student in students_list:
p1= list (self.data.Code[data.ID==student])
for student1 in students_list:
    p2= list data.Code[data.PID==student1])
    student_l=[]
    for l in p1:
        for m in p2:
            student_l.append(relationship(l,m))

    max_score = np.max(np.array(student_l).astype(np.float))
    relationship_matrix.loc[student,student1] = max_score

print relationship_matrix

There are some "optimisations" you can do, here are a few examples I can quickly spot (assuming code1_ancestors , code2_ancestors , etc. are lists or something equivalent): 您可以执行一些“优化”,以下是一些我可以快速发现的示例(假设code1_ancestorscode2_ancestors等是列表或等效的列表):

common_ancestors = []
for a1 in code1_ancestors:
    for a2 in code2_ancestors:
        if a1==a2:
            common_ancestors.append(a1)

can be made way faster by: 可以通过以下方法更快地实现:

set(code1_ancestors)&set(code2_ancestors)

and to mention that your for loops actually can end up with duplicates of the common ancestors. 并提到您的for循环实际上可能以常见祖先的副本结尾。

Or this: 或这个:

code_descendants=[]
for gd in ca.get_descendants():
    code_descendants.append(gd.name)

can be improved by: 可以通过以下方法进行改进:

code_descendants = [gf.name for in ca.get_descendants()]

Or this also: 或者这也是:

frequency= 0
for k in code_descendants:
        frequency= frequency + code_count.Count[k]

can be turned to: 可以变成:

frequency = code_count.loc[code_descendants, "Count"].sum()

Basically try to avoid doing things iteratively, ie for loops, and try to work with operations done the whole of the numpy arrays (underlying structure of pandas data frames). 基本上,尝试避免迭代地执行操作,即for循环,并尝试使用完成整个numpy数组(熊猫数据帧的基础结构)的操作。

I don't see a declaration of a Tree class, yet one is referenced in the first two lines of relationship(). 我没有看到Tree类的声明,但是在Relationship()的前两行中引用了一个声明。

You should be getting a "NameError: name 't' is not defined" 您应该得到一个“ NameError:名称't'未定义”

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM