简体   繁体   English

如何在Pandas脚本上利用矢量化以提高效率?

[英]How can I utilize vectorization on my Pandas script for efficiency?

this is a continuation from my previous post , where I wanted a faster and more efficient alternative to a standard Python loop, which performs some summing and multiplication on elements of each row. 这是我上一篇文章的延续,在那篇文章中 ,我想要一个更快,更高效的替代标准Python循环的方法,该循环对每行的元素执行一些求和和乘法。

Basically, what I have are two file inputs. 基本上,我有两个文件输入。 One is a list of all combinations for a group of SNPs, for example below for 3 SNPs: 一个是一组SNP的所有组合的列表,例如下面的3个SNP:

    AA   CC   TT
    AT   CC   TT
    TT   CC   TT
    AA   CG   TT
    AT   CG   TT
    TT   CG   TT
    AA   GG   TT
    AT   GG   TT
    TT   GG   TT
    AA   CC   TA
    AT   CC   TA
    TT   CC   TA
    AA   CG   TA
    AT   CG   TA
    TT   CG   TA
    AA   GG   TA
    AT   GG   TA
    TT   GG   TA
    AA   CC   AA
    AT   CC   AA
    TT   CC   AA
    AA   CG   AA
    AT   CG   AA
    TT   CG   AA
    AA   GG   AA
    AT   GG   AA
    TT   GG   AA

And the second is a table, containing some information for each SNP, notably their log(OR) for a disease and the frequency of the risk allele: 第二个表格是表格,其中包含每个SNP的一些信息,尤其是疾病的log(OR)和风险等位基因的频率:

SNP1             A       T       1.25    0.223143551314     0.97273 
SNP2             C       G       1.07    0.0676586484738    0.3     
SNP3             T       A       1.08    0.0769610411361    0.1136  

Below is my main code, in which I am looking to calculate a 'score' and a 'frequency' for each 'profile. 下面是我的主要代码,其中我希望为每个“配置文件”计算一个“得分”和一个“频率”。 The score is the sum of log(ORs) for each risk allele present in the profile, while the frequency is the frequencies multiplied together, assuming Hardy Weinberg equilibrium: 得分是配置文件中存在的每个风险等位基因的log(OR)的总和,而频率是频率乘以在一起,并假设Hardy Weinberg平衡:

import pandas as pd

numbers = pd.read_csv(table2, sep="\t", header=None)

combinations = pd.read_csv(table1, sep=" ", header=None)

def score_freq(line):
    score=0
    freq=1
    for j in range(len(line)):
        if line[j][1] != numbers.values[j][1]:   # homozygous for ref
            score+=0
            freq*=(float(1-float(numbers.values[j][6]))*float(1-float(numbers.values[j][6])))
        elif line[j][0] != numbers.values[j][1] and line[j][1] == numbers.values[j][1]: # heterozygous
            score+=(float(numbers.values[j][5]))
            freq*=(2*(float(1-float(numbers.values[j][6]))*float(numbers.values[j][6])))
        elif line[j][0] == numbers.values[j][1]:   # homozygous for risk
            score+=2*(float(numbers.values[j][5]))
            freq*=(float(numbers.values[j][6])*float(numbers.values[j][6]))

        if freq < 1e-05:   # threshold to stop loop in interest of efficiency 
            break

    return pd.Series([score, freq])

combinations[['score', 'freq']] = combinations.apply(lambda row: score_freq(row), axis=1)
#combinations[['score', 'freq']] = score_freq(combinations.values) # vectorization?

print(combinations)

I was referring to this site , where they go over the fastest way to loop over a Pandas dataframe. 我指的是这个网站 ,他们在那儿浏览了循环遍历Pandas数据框的最快方法。 I have been able to use the Pandas apply method, but I am not sure how to perform the vectorization method over the Pandas series. 我已经可以使用Pandas套用方法,但是我不确定如何在Pandas系列上执行矢量化方法。 Other than that, do suggest any way in which I can improve my script to make it more efficient, thanks! 除此之外,请提出任何可以改善脚本以使其更有效的方式的建议,谢谢!

I would suggest utilising the NumPy Python library to make your pd script more efficient. 我建议利用NumPy Python库来提高您的pd脚本的效率。 The idea behind NumPy is that you can use vectorization to avoid FOR loops and therefore process loads of data very efficiently. NumPy背后的想法是,您可以使用向量化来避免FOR循环,因此可以非常有效地处理数据负载。 When working with Numpy, you are basically converting your data into Numpy arrays. 使用Numpy时,基本上是将数据转换为Numpy数组。 You can find the extensive documentation here . 您可以在此处找到详细的文档。 To answer your question, you can perform mathematical operations on numpy arrays like this: 要回答您的问题,您可以像这样对numpy数组执行数学运算:

a = np.array([1, 2, 3, 4])
a + 1                // to add 1 to every element in the array

a * 2                // to multiply each element in the array by 2

which is way more efficient than if you where to use FOR loops in pure python. 这比在纯python中的FOR循环中使用效率更高。

Hope this helps. 希望这可以帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM