简体   繁体   English

有没有办法加快此熊猫功能?

[英]Is there a way to speed up this pandas function?

I know that vectorized functions are the preferred way to write code for speed, but I can't figure out a way to do what this function does without loops. 我知道向量化函数是编写代码以提高速度的首选方法,但是我想不出一种方法来实现该函数在没有循环的情况下的工作。 The way I have written this function results in an extremely slow completion time. 我编写此函数的方式导致完成时间极慢。 (Passing two dataframes with 100 columns and 2000 rows as arguments, this function takes 100 seconds to complete. I was hoping more for like 1 second.) (将两个具有100列和2000行的数据帧作为参数传递,此功能需要100秒才能完成。我希望能花费更多的时间,例如1秒。)

def gen_fuzz_logic_signal(longp, shortp):
    # Input dataframes should have 0, -1, or 1 value
    flogic_signal = pd.DataFrame(index = longp.index, columns = longp.columns)
    for sym in longp.columns:
        print sym
        prev_enter = 0
        for inum in range(0, len(longp.index)):
            cur_val = np.nan
            if longp.ix[inum, sym] == 0  and prev_enter == +1:
                cur_val = 0.5
            if shortp.ix[inum, sym] == 0 and prev_enter == -1:
                cur_val = -0.5
            if longp.ix[inum, sym] == 1 and shortp.ix[inum, sym] == -1:
                if longp.ix[inum - 1, sym] != 1:
                    cur_val = 1
                    prev_enter = 1
                elif shortp.ix[inum - 1, sym] != -1:
                    cur_val = -1
                    prev_enter = -1
                else:
                    cur_val = prev_enter
            else:
                if longp.ix[inum, sym] == 1:
                    cur_val = 1
                    prev_enter = 1
                if shortp.ix[inum, sym] == -1:
                    cur_val = -1
                    prev_enter = -1
            flogic_signal.ix[inum, sym] = cur_val
    return flogic_signal

The inputs to the function are simply two dataframes with values of either 1, -1, or 0. I would really appreciate it if anyone had ideas for how to vectorize this or speed it up. 该函数的输入只是两个数据帧,其值分别为1,-1或0。如果有人对如何矢量化或加速它有想法,我将不胜感激。 I tried replacing the ".ix[inum, sym]" with "[sym][inum]" but that's even slower. 我尝试用“ [sym] [inum]”替换“ .ix [inum,sym]”,但速度甚至更慢。

           GOOG longp GOOG shortp GOOG func result
2011-07-28          0          -1               -1
2011-07-29          0          -1               -1
2011-08-01          0          -1               -1
2011-08-02          0          -1               -1
2011-08-03          0          -1               -1
2011-08-04          0          -1               -1
2011-08-05          0          -1               -1
2011-08-08          0           0             -0.5
2011-08-09          0           0             -0.5
2011-08-10          0           0             -0.5
2011-08-11          0           0             -0.5
2011-08-12          1           0                1
2011-08-15          1           0                1
2011-08-16          1           0                1
2011-08-17          1           0                1
2011-08-18          1           0                1
2011-08-19          1           0                1
2011-08-22          1           0                1
2011-08-23          1           0                1
2011-08-24          1           0                1
2011-08-25          1           0                1
2011-08-26          1           0                1
2011-08-29          1           0                1
2011-08-30          1           0                1
2011-08-31          1           0                1
2011-09-01          1           0                1
2011-09-02          1           0                1
2011-09-06          1           0                1
2011-09-07          1           0                1
2011-09-08          1           0                1
2011-09-09          1           0                1
2011-09-12          1           0                1
2011-09-13          1           0                1
2011-09-14          1           0                1
2011-09-15          1           0                1
2011-09-16          1           0                1
2011-09-19          1           0                1
2011-09-20          1           0                1
2011-09-21          1           0                1
2011-09-22          1           0                1
2011-09-23          1           0                1
2011-09-26          1           0                1
2011-09-27          1           0                1
2011-09-28          1           0                1
2011-09-29          0           0              0.5
2011-09-30          0          -1               -1
2011-10-03          0          -1               -1
2011-10-04          0          -1               -1
2011-10-05          0          -1               -1
2011-10-06          0          -1               -1
2011-10-07          0          -1               -1
2011-10-10          0          -1               -1
2011-10-11          0          -1               -1
2011-10-12          0          -1               -1
2011-10-13          0          -1               -1
2011-10-14          0          -1               -1
2011-10-17          0          -1               -1
2011-10-18          0          -1               -1
2011-10-19          0          -1               -1
2011-10-20          0          -1               -1


           IBM longp IBM shortp IBM func result
2012-05-01         1         -1               1
2012-05-02         1         -1               1
2012-05-03         1         -1               1
2012-05-04         1         -1               1
2012-05-07         1         -1               1
2012-05-08         1          0               1
2012-05-09         1          0               1
2012-05-10         1          0               1
2012-05-11         1          0               1
2012-05-14         1          0               1
2012-05-15         1          0               1
2012-05-16         0         -1              -1
2012-05-17         0         -1              -1
2012-05-18         0         -1              -1
2012-05-21         0         -1              -1
2012-05-22         0         -1              -1
2012-05-23         0         -1              -1
2012-05-24         0         -1              -1
2012-05-25         0         -1              -1
2012-05-29         0         -1              -1
2012-05-30         0         -1              -1
2012-05-31         0         -1              -1
2012-06-01         0         -1              -1
2012-06-04         0         -1              -1
2012-06-05         0         -1              -1
2012-06-06         0         -1              -1
2012-06-07         0         -1              -1
2012-06-08         1         -1               1
2012-06-11         1         -1               1
2012-06-12         1         -1               1
2012-06-13         1         -1               1
2012-06-14         1         -1               1
2012-06-15         1         -1               1
2012-06-18         1         -1               1
2012-06-19         1         -1               1
2012-06-20         1         -1               1
2012-06-21         1          0               1
2012-06-22         1          0               1
2012-06-25         1          0               1
2012-06-26         1          0               1
2012-06-27         1          0               1
2012-06-28         1          0               1
2012-06-29         1          0               1

EDIT: 编辑:

I just reran some old code that used similar looping through a pandas DataFrame to set values. 我只是重新运行了一些旧代码,这些代码使用类似的循环遍历了熊猫DataFrame来设置值。 It used to take maybe 5 seconds, and now I see it's taking maybe 100x that. 过去可能要花5秒钟,现在我看到的大概是100倍。 I'm wondering if this issue is due to something that changed in the more recent version of pandas. 我想知道这个问题是否是由于更新的熊猫版本中的某些内容引起的。 That's the only variable I can think of that's changed. 这是我能想到的唯一变量。 See this code below. 请参阅下面的代码。 This takes 73 seconds to run on my computer using Pandas 0.11. 使用Pandas 0.11在我的计算机上运行需要73秒。 This seems very slow for a pretty basic function albeit one that operates elementwise, but still. 对于一个相当基本的功能而言,这似乎很慢,尽管它是逐元素操作的,但仍然如此。 If anyone has a chance, I'd be curious how long the below takes on your computer and your version of pandas. 如果有人有机会,我很好奇以下内容在您的计算机和熊猫版本上花费了多长时间。

import time
import numpy as np
import pandas as pd
def timef(func, *args):
    start= time.clock()
    for i in range(2):
        func(*args)
    end= time.clock()
    time_complete = (end-start)/float(2)
    print time_complete

def tfunc(num_row, num_col):
    df = pd.DataFrame(index = np.arange(1,num_row), columns = np.arange(1,num_col))
    for col in df.columns:
        for inum in range(1, len(df.index)):
            df.ix[inum, col] = 0 #np.nan
    return df

timef(tfunc, 1000, 1000)  <<< This takes 73 seconds on a Core i5 M460 2.53gz Windows 7 laptop.

EDIT 2 7-9-13 1:23pm: 编辑2 7-9-13 1:23 pm:

I found a temporary solution! 我找到了临时解决方案! I changed the code to the below. 我将代码更改为以下内容。 Essentially converted each column to an ndarray, and then assembled the new column in a python list before inserting back into a column in the new pandas DataFrame. 本质上将每列转换为ndarray,然后将新列组装到python列表中,然后再插入新的pandas DataFrame中的列。 To do 50 columns of about 2000 rows using the old version above took 101 seconds. 使用上面的旧版本来处理50列约2000行需要101秒。 The version below takes only 0.19 seconds! 下面的版本仅需0.19秒! Fast enough for me for now. 现在对我来说足够快。 Not sure why .ix is so slow. 不知道为什么.ix这么慢。 Like I said above, in earlier versions of pandas I believe elementwise operations were much faster. 就像我上面说的那样,我相信在早期版本的熊猫中,元素操作要快得多。

def gen_fuzz_logic_signal3(longp, shortp):
    # Input dataframes should have 0 or 1 value
    flogic_signal = pd.DataFrame(index = longp.index, columns = longp.columns)
    for sym in longp.columns:
        coll = longp[sym].values
        cols = shortp[sym].values
        prev_enter = 0
        newcol = [None] * len(coll)
        for inum in range(1, len(coll)):
            cur_val = np.nan
            if coll[inum] == 0  and prev_enter == +1:
                cur_val = 0.5
            if cols[inum] == 0 and prev_enter == -1:
                cur_val = -0.5
            if coll[inum] == 1 and cols[inum] == -1:
                if coll[inum -1] != 1:
                    cur_val = 1
                    prev_enter = 1
                elif cols[inum-1] != -1:
                    cur_val = -1
                    prev_enter = -1
                else:
                    cur_val = prev_enter
            else:
                if coll[inum] == 1:
                    cur_val = 1
                    prev_enter = 1
                if cols[inum] == -1:
                    cur_val = -1
                    prev_enter = -1
            newcol[inum] = cur_val
        flogic_signal[sym] = newcol
    return flogic_signal

I believe .ix implementation did change in 0.11. 我相信.ix的实现确实在0.11中发生了变化。 ( http://pandas.pydata.org/pandas-docs/stable/whatsnew.html ) Not sure if its related. http://pandas.pydata.org/pandas-docs/stable/whatsnew.html )不确定是否相关。

One quick speedup I got on 0.10.1 is when I changed tfunc to below to cache the column/series being updated 我获得0.10.1的一种快速提速是当我将tfunc更改为以下以缓存要更新的列/系列时

def tfunc(num_row, num_col):
   df = pd.DataFrame(index = np.arange(1,num_row), columns = np.arange(1,num_col))
   for col in df.columns:
       sdf = df[col]
       for inum in range(1, len(df.index)):
           sdf.ix[inum] = 0 #np.nan
   return df

It went from ~80 to ~9 on my machine 它在我的机器上从〜80变为〜9

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM