The sigmoid activation function is a popular machine learning mathematical function, especially for neural networks. Its capacity to convert any real input to a 0–1 number makes it important. This makes it useful for binary classification tasks, which classify inputs into two groups. Like all activation functions, sigmoid has pros and cons that make it better for some issues than others.
This article will explain the sigmoid activation function, its mathematical derivation, features, advantages, Disadvantages of Sigmoid Activation Function, and machine learning applications. This will also be discussed in backpropagation and gradient-based optimization.
What is a Sigmoid Activation Function?
The sigmoid activation function is a mathematical function that is widely employed in neural networks, notably for binary classification problems. It converts any real-valued input into a value between 0 and 1, making it appropriate for describing probabilities.
A mathematical definition of the sigmoid function is:
𝑆 ( 𝑥 ) = 1 /(1 + 𝑒−𝑥)
Where:
- 𝑆 ( 𝑥 ) is the sigmoid function outputs.
- The input to the function is x.
- e is the base of natural logarithm.
It’s called the logistic sigmoid function since it’s a logistic function. This function has a sigmoid “S” curve. When the input 𝑥 is large (positive or negative), the output approaches limitations 1 and 0.
As an example:
If 𝑥 = 0, S(0) = 1/(1+e0) = 1/2 = 0.5
- As 𝑥 approaches positive infinity, S(𝑥) approaches 1.
- As 𝑥 approaches negative infinity, S(𝑥) approaches 0.
The sigmoid function is excellent for binary classification because of its bounded output, which predicts probabilities from 0 to 1.
Properties of the Sigmoid Function
Important aspects of the sigmoid function affect its use in machine learning:
- Range of Output: For all real inputs, the sigmoid function outputs between 0 and 1. This helps with probabilities and binary judgments. Logistic regression or binary classification can use sigmoid output to express positive class probability.
- Smooth and Differentiable: The sigmoid function is smooth and differentiable everywhere. This fact is significant for optimization algorithms like gradient descent, which update model parameters using the activation function gradient.
- Monotonicity: Because the sigmoid function is monotonically expanding, the output 𝑆(𝑥) expands alongside the input 𝑥. This maintains a consistent and predictable input-output relationship.
- Symmetry Around 0: The sigmoid function is symmetric when 𝑥 = 0. The sigmoid function returns values ranging from 0 to 0.5 for negative inputs and 0.5 to 1 for positive inputs.
- Output Saturation: For very large positive or negative inputs, the sigmoid function saturates and outputs close to 1 or 0. This saturation effect can cause disappearing gradients, which we will address later.
Advantages of Sigmoid Activation Function
Despite its drawbacks, the sigmoid function has many benefits:
- Intuitive Interpretation: Probability is straightforward to interpret because the sigmoid function’s output is confined between 0 and 1. This is useful in classification problems that involve event likelihood.
- Easy and Effective: The sigmoid function is easy to use and computationally efficient for small and large machine learning models. Its simplicity made it popular in early neural networks.
- Well-Behaved Derivative: The sigmoid function derivative is easy to calculate:
𝑆 ′ ( 𝑥 ) = 𝑆 ( 𝑥 ) ( 1 − 𝑆 ( 𝑥 ) )
Gradient descent optimization of neural networks requires efficient backpropagation during training.
Disadvantages of Sigmoid Activation Function
Sigmoid function offers advantages but also constraints that must be considered when creating machine learning models:

- Vanishing Gradient Problem: Vanishing gradient is a major issue with the sigmoid function. When 𝑥 is large (positive or negative), the sigmoid function gradient becomes small, resulting in modest weight updates during backpropagation, slowing training. Specifically, when 𝑥 is large, 𝑆 ′ (𝑥) approaches 0, hindering deep network training.
- Output Not Zero-Centered: The sigmoid function always returns positive values between 0 and 1. Lack of zero-centered outputs causes gradients to be in one direction during training, making learning wasteful. Slow convergence and extended training times can ensue.
- Computational Expensive for Large Networks: The sigmoid function is computationally economical for small models but may be costly for deep networks. This is because the function calculates output and gradients during training.
Neural Network Sigmoid Activation Function
For binary classification, neural networks use the sigmoid activation function in hidden and output layers. The output layer’s sigmoid function gives a probability distribution for binary classification issues. A neural network’s neurons accept inputs, weight them, apply an activation function (like sigmoid), and send the result to the next layer or output.
Sigmoid in Binary Classification: In binary classification, the neural network output layer uses the sigmoid function. Given a binary goal value (0 or 1), the sigmoid function can convert the weighted sum of inputs to a probability value between 0 and 1 signifying class membership.
Sigmoid in Multi-layer Perceptrons: Hidden layers of multi-layer perceptrons (MLPs) can use the sigmoid function. The vanishing gradient problem makes sigmoid activation less prevalent in newer deep neural networks, while ReLU (Rectified Linear Unit) is preferred due to its improved performance.
Sigmoid and Backpropagation
The derivative of the activation function adjusts network weights during training in the backpropagation technique. Derivative of sigmoid function:
𝑆 ′ ( 𝑥 ) = 𝑆 ( 𝑥 ) ( 1 − 𝑆 ( 𝑥 ) )
Backpropagation propagates errors backwards across the network and computes gradients using the chain rule. Gradient descent or other optimization algorithms adjust the weights using gradients.
The gradient of the sigmoid function is very modest when a neuron’s output is near to 0 or 1, resulting in small weight changes. Deep networks, where gradients can shrink dramatically through numerous layers, are especially affected by this issue.
Alternatives to Sigmoid
To solve the vanishing gradient problem and enhance training efficiency, various activation functions have been devised to overcome the sigmoid’s constraints. These alternatives include:
ReLU (Rectified Linear Unit)
The default activation function for most deep neural networks is ReLU. It addresses the vanishing gradient problem by passing positive inputs unmodified and zeroing negative inputs. ReLU speeds up deep network training due to its computational efficiency.
Leaky ReLU and Parametric ReLU
Leaky ReLU and Parametric ReLU provide a small, non-zero gradient for negative inputs, preventing the “dying ReLU” problem, where neurons stop learning.
Tanh (Hyperbolic Tangent)
Similar to sigmoid, Tanh produces values between -1 and 1, making it zero-centered. It is widely used in hidden layers as an alternative to sigmoid because its output range addresses some of its drawbacks.
Applications of Sigmoid in Machine Learning
The sigmoid function is popular for binary classification and logistic regression in machine learning. Common uses include:
- Binary classification: Predicting if an input is spam or not, or whether a client would buy a product.
- Logistic Regression: The sigmoid function maps predictions to probabilities in logistic regression, which predicts the outcome of a categorical dependent variable.
Sigmoid Activation Function Python
import math
# Sigmoid function implementation
def sigmoid(x):
return 1 / (1 + math.exp(-x))
# Test the function with some examples
inputs = [0, 2, -3, 5]
outputs = [sigmoid(x) for x in inputs]
# Print the results
for x, output in zip(inputs, outputs):
print(f"sigmoid({x}) = {output}")
Output:
sigmoid(0) = 0.5
sigmoid(2) = 0.8807970779778823
sigmoid(-3) = 0.04742587317756678
sigmoid(5) = 0.9933071490757153
In the code:
- The sigmoid function is defined as 1/(1 + 𝑒−𝑥).
- We then run the function with a few different input values (0, 2, -3, and 5).
- The results demonstrate how the sigmoid function transfers inputs to values between 0 and 1.
This is the fundamental implementation that employs the math.exp() function for the exponential. NumPy can also be used to perform more efficient vectorized computations, particularly when working with arrays or matrices.
Conclude
The sigmoid activation function underpins machine learning and neural networks. With its smooth, differentiable nature and ability to produce probabilities, it is ideal for binary classification applications and gradient-based neural network training.
Sigmoid function disadvantages include vanishing gradients and non-zero-centered outputs, which inhibit deep network training. Sigmoid is still employed in various applications and plays an important role in machine learning history, despite the fact that ReLU and Tanh are chosen for deep learning.
Understanding the advantages and limitations of activation functions such as the sigmoid is critical for developing successful and efficient machine learning models for a wide range of applications as technology advances.