Layer Normalization

Layer Normalization

Technique used in neural networks to normalize the inputs across the features within a layer, improving training stability and model performance, particularly in recurrent and transformer models.

Layer normalization works by normalizing the pre-activation inputs (the outputs of the previous layer) within a neural network layer, ensuring they have zero mean and unit variance across the features. This differs from batch normalization, which normalizes across the batch dimension. Layer normalization is especially useful in settings where batch sizes are small or vary, such as in recurrent neural networks (RNNs) and transformers, where it can stabilize training by reducing internal covariate shift. The technique also helps by making the network less sensitive to changes in scale, which can speed up convergence and improve generalization.

Layer normalization was first introduced in 2016 by Jimmy Ba and Diederik Kingma in their paper "Layer Normalization." It gained popularity with the rise of transformer-based models, where it became a standard component due to its effectiveness in stabilizing training in complex architectures like BERT and GPT.

The primary contributors to the development of layer normalization are Jimmy Lei Ba and Diederik P. Kingma, who proposed the method in their 2016 paper. Their work built on the idea of normalization techniques in deep learning, further expanding its use beyond convolutional and batch-processing networks to more diverse architectures.

Explainer

Layer Normalization

Bringing harmony to neural networks, one layer at a time

Click anywhere to start the normalization process

Layer Normalization adjusts the scale of activations within each layer of a neural network. Above, watch as varied neuron values (in color) are transformed to have zero mean and unit variance — a key technique that helps stabilize and accelerate neural network training, especially in transformers and deep networks.

Key insight: When a neuron becomes very different from others (brighter/larger), normalization adjusts all neurons proportionally. The brightest neuron represents the most "standard deviations away from mean" in the normalized space, while maintaining the relative patterns in the data.
Try clicking different areas to see how the normalization process affects the entire layer as a group, not just individual neurons!
Was this explainer helpful?

Newsletter