I have a Pandas DataFrame which contain transaction count of 2000 terminals for 30 days(columns are day of month) DataFrame looks like this:
trx.head()
TerminalID 8881 8882 8883 8884 8885 8886 ...
0 11546 0.0 0.0 0.0 0.0 0.0 0.0 ...
1 200002 0.0 0.0 0.0 0.0 0.0 0.0 ...
2 200512 1.0 0.0 0.0 1.0 1.0 0.0 ...
3 202630 3.0 1.0 1.0 0.0 1.0 1.0 ...
4 207000 2.0 4.0 1.0 6.0 3.0 7.0 ...
I want to use IsolationForest for anomaly detection on each row of my data.
First I convert each row to a new DataFrame and fit data on that, one by one for every row and I will add the the result to a list:
def find_anomaly(trx1,outliers_fraction):
scaler = StandardScaler()
np_scaled = scaler.fit_transform(trx1)
data = pd.DataFrame(np_scaled)
# train isolation forest
model = IsolationForest(contamination=outliers_fraction)
model.fit(data)
trx1['anomaly'] = pd.Series(model.predict(data))
return(trx1)
#This for is slow
list_terminal_trx = []
for i in range(0,len(trx)-1):
trx1=trx.iloc[i,1:].reset_index()
trx1.columns=['day','count']
trx1['day']=trx1['day'].astype(float)
list_terminal_trx.append(find_anomaly(trx1,outliers_fraction))
print('Learning for record',i)
The code above works fine but it is slow I wanted to know if there is a better way?
Edited1: thanks to @AT_asks advise I set n_jobs=-1 and now It is faster But is there any alternative to my for loop?
Edited2: with some modification I used what @AT_asks suggested to use apply() but I got no performance differences: For version takes 3:29:00 Apply Version Takes 3:25:28
Edited3: using iterrows() instead of for brings the same result: 3min 16s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
You might get some improvements if add this parameter
model = IsolationForest(contamination=outliers_fraction, n_jobs=-1)
Also, we could try this.
# Do not create instance every time
scaler = StandardScaler()
def find_anomaly(trx1,outliers_fraction):
np_scaled = scaler.fit_transform(trx1)
data = pd.DataFrame(np_scaled)
# train isolation forest
model = IsolationForest(contamination=outliers_fraction, n_jobs=-1)
model.fit(data)
trx1['anomaly'] = pd.Series(model.predict(data))
return(trx1)
# not loop but apply
list_terminal_trx = trx.apply(lambda x: find_anomaly(x,outliers_fraction), axis =1).values
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.