有没有一种更快的方法来在熊猫DataFrame的每一行上训练IsolationForest？

Question

I have a Pandas DataFrame which contain transaction count of 2000 terminals for 30 days(columns are day of month) DataFrame looks like this: 我有一个Pandas DataFrame，其中包含30天的2000个终端的交易计数（列是一个月中的一天）DataFrame看起来像这样：

trx.head()
    TerminalID 8881 8882    8883    8884    8885    8886    ... 
0   11546   0.0 0.0 0.0 0.0 0.0 0.0 ... 
1   200002  0.0 0.0 0.0 0.0 0.0 0.0 ... 
2   200512  1.0 0.0 0.0 1.0 1.0 0.0 ...
3   202630  3.0 1.0 1.0 0.0 1.0 1.0 ...
4   207000  2.0 4.0 1.0 6.0 3.0 7.0 ...

I want to use IsolationForest for anomaly detection on each row of my data. 我想使用IsolationForest对数据的每一行进行异常检测。

First I convert each row to a new DataFrame and fit data on that, one by one for every row and I will add the the result to a list: 首先，我将每一行转换为新的DataFrame并在其上拟合数据，每一行一个接一个，然后将结果添加到列表中：

def find_anomaly(trx1,outliers_fraction):
    scaler = StandardScaler()
    np_scaled = scaler.fit_transform(trx1)
    data = pd.DataFrame(np_scaled)
    # train isolation forest
    model =  IsolationForest(contamination=outliers_fraction)
    model.fit(data) 
    trx1['anomaly'] = pd.Series(model.predict(data))
    return(trx1)
#This for is slow
list_terminal_trx = []
for i in range(0,len(trx)-1):
    trx1=trx.iloc[i,1:].reset_index()
    trx1.columns=['day','count']
    trx1['day']=trx1['day'].astype(float)
    list_terminal_trx.append(find_anomaly(trx1,outliers_fraction))
    print('Learning for record',i)

The code above works fine but it is slow I wanted to know if there is a better way? 上面的代码工作正常，但是很慢，我想知道是否有更好的方法？

Edited1: thanks to @AT_asks advise I set n_jobs=-1 and now It is faster But is there any alternative to my for loop? Edited1：感谢@AT_asks建议我将n_jobs = -1设置为现在，它更快，但是我的for循环还有其他选择吗？

Edited2: with some modification I used what @AT_asks suggested to use apply() but I got no performance differences: For version takes 3:29:00 Apply Version Takes 3:25:28 Edited2：经过一些修改，我使用了@AT_asks建议使用apply（）的方法，但没有任何性能差异：对于版本需要3:29:00 Apply版本需要3:25:28

Edited3: using iterrows() instead of for brings the same result: 3min 16s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each) Edited3：使用iterrows（）代替for会带来相同的结果：每个循环3分钟16s±0 ns（平均±标准偏差，运行1次，每个循环1次）

Answer 1

You might get some improvements if add this parameter 如果添加此参数，您可能会得到一些改进

model =  IsolationForest(contamination=outliers_fraction, n_jobs=-1)

Also, we could try this. 另外，我们可以尝试一下。

# Do not create instance every time
scaler = StandardScaler()

def find_anomaly(trx1,outliers_fraction):    
    np_scaled = scaler.fit_transform(trx1)
    data = pd.DataFrame(np_scaled)
    # train isolation forest
    model =  IsolationForest(contamination=outliers_fraction, n_jobs=-1)
    model.fit(data) 
    trx1['anomaly'] = pd.Series(model.predict(data))
    return(trx1)

# not loop but apply
list_terminal_trx = trx.apply(lambda x: find_anomaly(x,outliers_fraction), axis =1).values

有没有一种更快的方法来在熊猫DataFrame的每一行上训练IsolationForest？

问题描述

1 个解决方案

解决方案1
1 2019-05-23 10:43:03

有没有一种更快的方法来在熊猫DataFrame的每一行上训练IsolationForest？

问题描述

1 个解决方案

解决方案1 1 2019-05-23 10:43:03

解决方案1
1 2019-05-23 10:43:03