花費太長時間來運行python腳本

Question

腳本基本上是根據借用的書（代碼）來衡量學生之間的關系。 因此，我使用ete2包為不同類型的書構建了一棵樹。 現在我正在嘗試編寫一段代碼，該代碼從樹和一個csv文件中獲取數據，並通過函數關系進行一些數據分析.csv文件包含50,000多行。 問題是運行代碼需要花費很長時間（大約7天），而它僅占用計算機CPU和內存的10％到20％。

這是我使用的csv文件的示例：

ID Code    Count 
1    A1...   6
1    A2...   5
2    A....   4
2    D....   1
2    A1...   2
3    D....   5
3    D1...   3
3    D2...   5

這是代碼：

from ete2 import Tree
import pandas as pd
import numpy as np
from __future__ import division
import math


data= pd.read_csv('data.csv', names=['ID','Code', 'Count'])
codes_list= list (set(data['Code']))
total_codes= data.shape[0]
students_list= list (set(data['ID']))



####################################


# generate the tree
t = Tree (".....;", format =0)
for i in codes_list:
    if '....' in i:
        node = t.search_nodes(name = '.....')
        node[0].add_child(name= i)
for i in codes_list:
    if '...' in i and '....' not in i:
        if i[0]+'....' in codes_list:
            node = t.search_nodes(name = i[0]+'....')
            node[0].add_child(name= i)
        else:
            node = t.search_nodes(name = '.....')
            node[0].add_child(name= i)

# save the tree in a file 
t.write( outfile= file_path + "Codes_tree.nh", format =3)
return t.get_ascii(show_internal=True)

####################################

def relationship(code1,code2):

    code1_ancestors= t.search_nodes(name=code1)[0].get_ancestors()
    code2_ancestors=t.search_nodes(name=code2)[0].get_ancestors(
    common_ancestors = []
    for a1 in code1_ancestors:
        for a2 in code2_ancestors:
            if a1==a2:
                common_ancestors.append(a1)
    IC_values = []
    for ca in common_ancestors:
        code_descendants=[]
        for gd in ca.get_descendants():
            code_descendants.append(gd.name)
        code_descendants.append(ca)
        frequency= 0
        for k in code_descendants:
                frequency= frequency + code_count.Count[k]

        IC = - (math.log (frequency / float (total_codes)))
        IC_values.append (IC)

    IC_max= max(IC_values)
    return IC_max

##################

relationship_matrix = pd.DataFrame(index=[students_list], columns=[students_list])
for student in students_list:
p1= list (self.data.Code[data.ID==student])
for student1 in students_list:
    p2= list data.Code[data.PID==student1])
    student_l=[]
    for l in p1:
        for m in p2:
            student_l.append(relationship(l,m))

    max_score = np.max(np.array(student_l).astype(np.float))
    relationship_matrix.loc[student,student1] = max_score

print relationship_matrix

Answer 1

您可以執行一些“優化”，以下是一些我可以快速發現的示例（假設code1_ancestors ， code2_ancestors等是列表或等效的列表）：

common_ancestors = []
for a1 in code1_ancestors:
    for a2 in code2_ancestors:
        if a1==a2:
            common_ancestors.append(a1)

可以通過以下方法更快地實現：

set(code1_ancestors)&set(code2_ancestors)

並提到您的for循環實際上可能以常見祖先的副本結尾。

或這個：

code_descendants=[]
for gd in ca.get_descendants():
    code_descendants.append(gd.name)

可以通過以下方法進行改進：

code_descendants = [gf.name for in ca.get_descendants()]

或者這也是：

frequency= 0
for k in code_descendants:
        frequency= frequency + code_count.Count[k]

可以變成：

frequency = code_count.loc[code_descendants, "Count"].sum()

基本上，嘗試避免迭代地執行操作，即for循環，並嘗試使用完成整個numpy數組（熊貓數據幀的基礎結構）的操作。

Answer 2

我沒有看到Tree類的聲明，但是在Relationship（）的前兩行中引用了一個聲明。

您應該得到一個“ NameError：名稱't'未定義”

花費太長時間來運行python腳本

問題描述

2 個解決方案

解決方案1
0 2016-02-26 19:20:38

解決方案2
0 2016-02-26 19:55:37

花費太長時間來運行python腳本

問題描述

2 個解決方案

解決方案1 0 2016-02-26 19:20:38

解決方案2 0 2016-02-26 19:55:37

解決方案1
0 2016-02-26 19:20:38

解決方案2
0 2016-02-26 19:55:37