如何優化處理 pandas 數據幀的嵌套循環代碼

Question

我是優化新手，需要幫助改進此代碼的運行時間。 它完成了我的任務，但它需要永遠。 關於改進它以使其運行得更快的任何建議？

這是代碼：

def probabilistic_word_weighting(df, lookup):

    # instantiate new place holder for class weights for each text sequence in the df
    class_probabilities = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
    for index, row in lookup.iterrows():
        if row.word in df.words.split():
            class_proba_ = row.class_proba.strip('][').split(', ')
            class_proba_ = [float(i) for i in class_proba_]
            class_probabilities = [a + b for a, b in zip(class_probabilities, class_proba_)]

    return class_probabilities

兩個輸入 df 如下所示：

df

index                                     word
1                               i  havent  been  back 
2                                            but  its 
3                   they  used  to  get  more  closer 
4                                             no  way 
5       when  we  have  some  type  of  a  thing  for
6                and  she  had  gone  to  the  doctor 
7                                                suze 
8        the  only  time  the  parents  can  call  is
9               i  didnt  want  to  go  on  a  cruise 
10                            people  come  aint  got

抬頭

index    word                               class_proba
6231    been    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.27899487]
8965    havent  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.27899487]
3270    derive  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.27899487]
7817    a       [0.0, 0.0, 7.451379, 6.552, 0.0, 0.0, 0.0, 0.0]
3452    hello   [0.0, 0.0, 0.0, 0.0, 0.000155327, 0.0, 0.0, 0.0]
5112    they    [0.0, 0.0, 0.00032289312, 0.0, 0.0, 0.0, 0.0, 0.0]
1012    time    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.27899487]
7468    some    [0.000193199, 0.0, 0.0, 0.000212947, 0.0, 0.0, 0.0, 0.0]
6428    people  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.27899487
5537    scuba   [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.27899487

它所做的基本上是遍歷查找中的每一行，其中包含一個單詞及其相關的 class 權重。 如果在 df.word 中的任何文本序列中找到該單詞，則將 lookup.word 的 class_probabilities 添加到分配給 df.word 中每個序列的 class_probabilities 變量中。 對於查找行的每次迭代，它都會遍歷 df 中的每一行。

怎樣才能更快地做到這一點？

Answer 1

IIUC，您將df.apply與 function 一起使用，但您可以這樣做。 這個想法不是每次找到相應的單詞時都對lookup行重做操作，而是做一次並重塑df以便能夠執行矢量化操作

1：用str.split 、 stack和to_frame重塑df的列詞，以得到每個詞的新行：

s_df = df['words'].str.split(expand=True).stack().to_frame(name='split_word')
print (s_df.head(8))
    split_word
0 0          i
  1     havent
  2       been
  3       back
1 0        but
  1        its
2 0       they
  1       used

2：通過lookup對 word 列、 str.strip 、 str.split和astype進行set_index查找，以獲得 dataframe，其中 word 作為索引，並且每一列中的 class_proba 值

split_lookup = lookup.set_index('word')['class_proba'].str.strip('][')\
                     .str.split(', ', expand=True).astype(float)
print (split_lookup.head())
          0    1         2      3         4    5    6         7
word                                                           
been    0.0  0.0  0.000000  0.000  0.000000  0.0  0.0  5.278995
havent  0.0  0.0  0.000000  0.000  0.000000  0.0  0.0  5.278995
derive  0.0  0.0  0.000000  0.000  0.000000  0.0  0.0  5.278995
a       0.0  0.0  7.451379  6.552  0.000000  0.0  0.0  0.000000
hello   0.0  0.0  0.000000  0.000  0.000155  0.0  0.0  0.000000

3： Merge兩者， drop不必要的列和groupby ，level=0是df和sum的原始索引

df_proba = s_df.merge(split_lookup, how='left',
                      left_on='split_word', right_index=True)\
               .drop('split_word', axis=1)\
               .groupby(level=0).sum()
print (df_proba.head())
          0    1         2         3    4    5    6         7
0  0.000000  0.0  0.000000  0.000000  0.0  0.0  0.0  10.55799
1  0.000000  0.0  0.000000  0.000000  0.0  0.0  0.0   0.00000
2  0.000000  0.0  0.000323  0.000000  0.0  0.0  0.0   0.00000
3  0.000000  0.0  0.000000  0.000000  0.0  0.0  0.0   0.00000
4  0.000193  0.0  7.451379  6.552213  0.0  0.0  0.0   0.00000

4：最后，轉換為列表並使用to_numpy和tolist重新分配給原始 df ：

df['class_proba'] = df_proba.to_numpy().tolist()
print (df.head())
                                           words  \
0                          i  havent  been  back   
1                                       but  its   
2              they  used  to  get  more  closer   
3                                        no  way   
4  when  we  have  some  type  of  a  thing  for   

                                         class_proba  
0   [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 10.55798974]  
1           [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]  
2  [0.0, 0.0, 0.00032289312, 0.0, 0.0, 0.0, 0.0, ...  
3           [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]  
4  [0.000193199, 0.0, 7.451379, 6.552212946999999...

如何優化處理 pandas 數據幀的嵌套循環代碼

問題描述

1 個解決方案

解決方案1
3 已采納 2020-04-29 01:13:40

如何優化處理 pandas 數據幀的嵌套循環代碼

問題描述

1 個解決方案

解決方案1 3 已采納 2020-04-29 01:13:40

解決方案1
3 已采納 2020-04-29 01:13:40