简体繁体中英

Neural Activation Functions - Difference between Logistic / Tanh / etc

原文 2012-08-07 13:30:46 2 4 function/ math/ machine-learning/ statistics/ neural-network

I'm writing some basic neural network methods - specifically the activation functions - and have hit the limits of my rubbish knowledge of math. I understand the respective ranges (-1/1) (0/1) etc, but the varying descriptions and implementations have me confused.

Specifically sigmoid , logistic , bipolar sigmoid , tanh , etc.

Does sigmoid simply describe the shape of the function irrespective of range? If so, then is tanh a 'sigmoid function'?

I have seen 'bipolar sigmoid' compared against 'tanh' in a paper, however I have seen both functions implemented (in various libraries) with the same code:

(( 2/ (1 + Exp(-2 * n))) - 1). Are they exactly the same thing?

Likewise, I have seen logistic and sigmoid activations implemented with the same code:

( 1/ (1 + Exp(-1 * n))). Are these also equivalent?

Lastly, does it even matter that much in practise? I see on wiki a plot of very similar sigmoid functions - could any of these be used? Some look like they may be considerably faster to compute than others.

4 answers

Logistic function: e ^x /(e ^x + e ^c )

Special ("standard") case of the logistic function: 1/(1 + e ^-x )

Bipolar sigmoid: never heard of it.

Tanh: (e ^x -e ^-x )/(e ^x + e ^-x )

Sigmoid usually refers to the shape (and limits), so yes, tanh is a sigmoid function. But in some contexts it refers specifically to the standard logistic function, so you have to be careful. And yes, you could use any sigmoid function and probably do just fine.

(( 2/ (1 + Exp(-2 * x))) - 1) is equivalent to tanh(x).

Generally the most important differences are a. smooth continuously differentiable like tanh and logistic vs step or truncated b. competitive vs transfer c. sigmoid vs radial d. symmetric (-1,+1) vs asymmetric (0,1)

Generally the differentiable requirement is needed for hidden layers and tanh is often recommended as being more balanced. The 0 for tanh is at the fastest point (highest gradient or gain) and not a trap, while for logistic 0 is the lowest point and a trap for anything pushing deeper into negative territory. Radial (basis) functions are about distance from a typical prototype and good for convex circular regions about a neuron, while the sigmoid functions are about separating linearly and good for half spaces - and it will require many for good approximation to a convex region, with circular/spherical regions being worst for sigmoids and best for radials.

Generally, the recommendation is for tanh on the intermediate layers for +/- balance, and suit the output layer to the task (boolean/dichotomous class decision with threshold, logistic or competitive outputs (eg softmax, a self-normalizing multiclass generalization of logistic); regression tasks can even be linear). The output layer doesn't need to be continuously differentiable. The input layer should be normalized in some way, either to [0,1] or better still standardization or normalization with demeaning to [-1,+1]. If you include a dummy input of 1 then normalize so ||x||p = 1 you are dividing by a sum or length and this magnitude information is retained in the dummy bias input rather than being lost. If you normalize over examples, this is technically interfering with your test data if you look at them, or they may be out of range if you don't. But with ||2 normalization such variations or errors should approach the normal distribution if they are effects of natural distribution or error. This means that they with high probability they won't exceed the original range (probably around 2 standard deviations) by more than a small factor (viz. such overrange values are regarded as outliers and not significant).

So I recommend unbiased instance normalization or biased pattern standardization or both on the input layer (possibly with data reduction with SVD), tanh on the hidden layers, and a threshold function, logistic function or competitive function on the output for classification, but linear with unnormalized targets or perhaps logsig with normalized targets for regression.

The word is (and I've tested) that in some cases it might be better to use the tanh than the logistic since

Outputs near Y = 0 on the logistic times a weight w yields a value near 0 which doesn't have much effect on the upper layers which it affects (although absence also affects), however a value near Y = -1 on tahn times a weight w might yield a large number which has more numeric effect.
The derivative of tanh ( 1 - y^2 ) yields values greater than the logistic ( y (1 -y) = y - y^2 ). For example, when z = 0 , the logistic function yields y = 0.5 and y' = 0.25 , for tanh y = 0 but y' = 1 (you can see this in general just by looking at the graph). MEANING that a tanh layer might learn faster than a logistic layer because of the magnitude of the gradient.

Bipolar sigmoid = (1-e^(-x))/(1 + e^(-x))

Detailed explanation can be found at here

Difference between functions in Javascript

Difference between these two functions

What is the difference between these functions

Difference between Monads and Functions

what is the difference between these two functions?

Difference between copy functions in python

Difference between OO and Grouped Functions?

Difference between these two partition functions

Difference between functions and non-functions?

Difference between functions and public functions in classes

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Difference between functions in Javascript Difference between these two functions What is the difference between these functions Difference between Monads and Functions what is the difference between these two functions? Difference between copy functions in python Difference between OO and Grouped Functions? Difference between these two partition functions Difference between functions and non-functions? Difference between functions and public functions in classes

Related Tags

Neural Activation Functions - Difference between Logistic / Tanh / etc

Question

4 answers

solution1
9 ACCPTED 2012-08-07 15:23:53

solution2
5 2014-08-22 23:48:49

solution3
2 2015-02-02 02:44:20

solution4
0 2015-09-23 09:23:05

Neural Activation Functions - Difference between Logistic / Tanh / etc

Question

4 answers

solution1 9 ACCPTED 2012-08-07 15:23:53

solution2 5 2014-08-22 23:48:49

solution3 2 2015-02-02 02:44:20

solution4 0 2015-09-23 09:23:05

solution1
9 ACCPTED 2012-08-07 15:23:53

solution2
5 2014-08-22 23:48:49

solution3
2 2015-02-02 02:44:20

solution4
0 2015-09-23 09:23:05