[英]Taking too long to run a python script
The script is basically measuring the relationship between students based on the book they have borrowed (code). 脚本基本上是根据借用的书(代码)来衡量学生之间的关系。 So I've built a tree for different type of books using ete2 package.
因此,我使用ete2包为不同类型的书构建了一棵树。 Now I'm trying to write a piece of code that takes the data from the tree and a csv file and does some data analysis through the function relationship.The csv file contains more than 50,000 rows.
现在我正在尝试编写一段代码,该代码从树和一个csv文件中获取数据,并通过函数关系进行一些数据分析.csv文件包含50,000多行。 The problem is that it takes to long to run the code (around 7 days), while it uses only 10 to 20% of my computer CPU and memory.
问题是运行代码需要花费很长时间(大约7天),而它仅占用计算机CPU和内存的10%到20%。
Here is an example of the csv file I've used: 这是我使用的csv文件的示例:
ID Code Count
1 A1... 6
1 A2... 5
2 A.... 4
2 D.... 1
2 A1... 2
3 D.... 5
3 D1... 3
3 D2... 5
Here is the code: 这是代码:
from ete2 import Tree
import pandas as pd
import numpy as np
from __future__ import division
import math
data= pd.read_csv('data.csv', names=['ID','Code', 'Count'])
codes_list= list (set(data['Code']))
total_codes= data.shape[0]
students_list= list (set(data['ID']))
####################################
# generate the tree
t = Tree (".....;", format =0)
for i in codes_list:
if '....' in i:
node = t.search_nodes(name = '.....')
node[0].add_child(name= i)
for i in codes_list:
if '...' in i and '....' not in i:
if i[0]+'....' in codes_list:
node = t.search_nodes(name = i[0]+'....')
node[0].add_child(name= i)
else:
node = t.search_nodes(name = '.....')
node[0].add_child(name= i)
# save the tree in a file
t.write( outfile= file_path + "Codes_tree.nh", format =3)
return t.get_ascii(show_internal=True)
####################################
def relationship(code1,code2):
code1_ancestors= t.search_nodes(name=code1)[0].get_ancestors()
code2_ancestors=t.search_nodes(name=code2)[0].get_ancestors(
common_ancestors = []
for a1 in code1_ancestors:
for a2 in code2_ancestors:
if a1==a2:
common_ancestors.append(a1)
IC_values = []
for ca in common_ancestors:
code_descendants=[]
for gd in ca.get_descendants():
code_descendants.append(gd.name)
code_descendants.append(ca)
frequency= 0
for k in code_descendants:
frequency= frequency + code_count.Count[k]
IC = - (math.log (frequency / float (total_codes)))
IC_values.append (IC)
IC_max= max(IC_values)
return IC_max
##################
relationship_matrix = pd.DataFrame(index=[students_list], columns=[students_list])
for student in students_list:
p1= list (self.data.Code[data.ID==student])
for student1 in students_list:
p2= list data.Code[data.PID==student1])
student_l=[]
for l in p1:
for m in p2:
student_l.append(relationship(l,m))
max_score = np.max(np.array(student_l).astype(np.float))
relationship_matrix.loc[student,student1] = max_score
print relationship_matrix
There are some "optimisations" you can do, here are a few examples I can quickly spot (assuming code1_ancestors
, code2_ancestors
, etc. are lists or something equivalent): 您可以执行一些“优化”,以下是一些我可以快速发现的示例(假设
code1_ancestors
, code2_ancestors
等是列表或等效的列表):
common_ancestors = []
for a1 in code1_ancestors:
for a2 in code2_ancestors:
if a1==a2:
common_ancestors.append(a1)
can be made way faster by: 可以通过以下方法更快地实现:
set(code1_ancestors)&set(code2_ancestors)
and to mention that your for loops actually can end up with duplicates of the common ancestors. 并提到您的for循环实际上可能以常见祖先的副本结尾。
Or this: 或这个:
code_descendants=[]
for gd in ca.get_descendants():
code_descendants.append(gd.name)
can be improved by: 可以通过以下方法进行改进:
code_descendants = [gf.name for in ca.get_descendants()]
Or this also: 或者这也是:
frequency= 0
for k in code_descendants:
frequency= frequency + code_count.Count[k]
can be turned to: 可以变成:
frequency = code_count.loc[code_descendants, "Count"].sum()
Basically try to avoid doing things iteratively, ie for
loops, and try to work with operations done the whole of the numpy arrays (underlying structure of pandas data frames). 基本上,尝试避免迭代地执行操作,即
for
循环,并尝试使用完成整个numpy数组(熊猫数据帧的基础结构)的操作。
I don't see a declaration of a Tree class, yet one is referenced in the first two lines of relationship(). 我没有看到Tree类的声明,但是在Relationship()的前两行中引用了一个声明。
You should be getting a "NameError: name 't' is not defined" 您应该得到一个“ NameError:名称't'未定义”
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.