简体   繁体   English

Python代码加速

[英]Python Code Speed Up

My code should compare two vectors saved as dictionary (two pickle files) and save the result into a pickle file too. 我的代码应该比较两个保存为字典的向量(两个pickle文件),并将结果也保存到pickle文件中。 This works but very slowly. 这有效,但是非常缓慢。 For one compare result I'm waiting about 7:2o min. 对于一个比较结果,我正在等待大约7:2o。 Because I have a lot of videos (exactly 2033) this prog will run about 10 days. 因为我有很多视频(恰好是2033年),所以该编将运行大约10天。 This is too long. 这太长了。 How can I speed up my code for Python 2.7? 如何加快Python 2.7的代码速度?

import math
import csv
import pickle
from itertools import izip

global_ddc_file = 'E:/global_ddc.p'
io = 'E:/AV-Datensatz'
v_source = ''

def dot_product(v1, v2):
    return sum(map(lambda x: x[0] * x[1], izip(v1, v2))) # izip('ABCD', 'xy') --> Ax By

def cosine_measure(v1, v2):
    prod = dot_product(v1, v2)
    len1 = math.sqrt(dot_product(v1, v1))
    len2 = math.sqrt(dot_product(v2, v2))
    if (len1 * len2) <> 0:
        out = prod / (len1 * len2)
    else: out = 0
    return out

def findSource(v):
    v_id = "/"+v[0].lstrip("<http://av.tib.eu/resource/video").rstrip(">")
    v_source = io + v_id
    v_file = v_source + '/vector.p'
    source = [v_id, v_source, v_file]
    return source

def getVector(v, vectorCol):
    with open (v, 'rb') as f:
        try:
            vector_v = pickle.load(f)
        except: print 'file couldnt be loaded'
        tf_idf = []
        tf_idf = [vec[1][vectorCol] for vec in vector_v]
    return tf_idf

def compareVectors(v1, v2, vectorCol):
    v1_source = findSource(v1)
    v2_source = findSource(v2)
    V1 = getVector(v1_source[2], vectorCol)
    V2 = getVector(v2_source[2], vectorCol)
    sim = [v1_source[0], v2_source[0], cosine_measure(V1, V2)]
    return sim

#with open('videos_av_portal_cc_3.0_nur2bspStanford.csv', 'rb') as dataIn:
with open('videos_av_portal_cc_3.0_vollstaendig.csv', 'rb') as dataIn:
#with open('videos_av_portal_cc_3.0.csv', 'rb') as dataIn:
    try:
        reader = csv.reader(dataIn)

        v_source = []
        for row in reader:
            v_source.append(findSource(row))
        #print v_source

        for one in v_source:
            print one[1]
            compVec = []
            for another in v_source:
                if one <> another: 
                    compVec.append(compareVectors(one, another, 3))
            compVec_sort = sorted(compVec, key=lambda cosim: cosim[2], reverse = True) 

            # save vector file for each video
            with open (one[1] + '/compare.p','wb') as f:
                pickle.dump(compVec_sort,f)

    finally:
        dataIn.close()  

Split code in 2 parts: 1. Load Dictionary in vectors 2. Compare 2 dictionaries using multiprocess multiprocess example 3. Launch process simultaneously according to memory availability and end the process after 8 mins. 将代码分为2部分:1.在向量中加载字典2.使用多进程进程示例比较2个字典3.根据内存可用性同时启动进程,并在8分钟后结束进程。 Then update the 3rd dictionary. 然后更新第三个字典。 4. Then relaunch process on next set of data , follow step 3 and continue till the dictionary length. 4.然后在下一组数据上重新启动过程,按照步骤3继续直到字典长度。

This should reduce total turnaround time. 这将减少总的周转时间。 Let me know if you need code . 让我知道您是否需要代码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM