Mixture of Experts
This document describes the Mixture of Experts (MoE) components used in the DMS model. MoE is a technique that improves model capacity and efficiency by routing different inputs to specialized sub-networks (experts).
Position-wise Feed-forward Layer
The standard feed-forward network serves as our baseline for comparison with MoE layers. It processes each position in the sequence independently through a simple two-layer network with expansion and projection. This is the traditional architecture used in transformer models.
Noisy Top-k Router
The router is the brain of the MoE system - it decides which experts should process each token. Key features:
- Uses learned routing weights to match tokens with relevant experts
- Adds learned noise to encourage exploration and prevent expert collapse
- Selects top-k experts per token to enable specialization while maintaining redundancy
- Produces sparse routing probabilities to enable efficient computation
The noise mechanism is particularly important as it:
- Prevents tokens from always taking the same path
- Helps balance load across experts
- Improves training stability
Expert Network
Each expert is a specialized feed-forward network that becomes tuned to handle specific types of tokens or patterns. The expert architecture mirrors the standard feed-forward layer, but each expert can learn different specializations. For example:
- Some experts might focus on syntax
- Others on specific vocabulary domains
- Others on particular transformation patterns
Sparse MoE Layer
This is where everything comes together into an efficient, scalable system:
- Token Routing: The router examines each token and decides which experts should process it
- Load Balancing:
- Uses capacity factors to prevent expert overload
- Ensures even utilization of experts
- Handles cases where too many tokens want the same expert
- Parallel Processing:
- Tokens are grouped by assigned expert
- Each expert processes its assigned group
- Results are combined based on routing weights
The sparse computation pattern makes MoE layers much more efficient than simply running multiple full-size feed-forward layers.
Intuition Behind MoE
Think of MoE like a team of specialists:
- Instead of every token going through the same general-purpose network
- Tokens are routed to experts that are best suited to process them
- Each expert becomes specialized in handling certain types of patterns
- The router learns to match tokens with the right experts
This specialization allows the model to:
- Handle a wider range of patterns effectively
- Scale capacity without scaling computation for every token
- Develop focused expertise in different aspects of the task
Source Code
directmultistep.model.components.moe
Expert
Bases: Module
A single expert in the MoE layer.
Applies a two-layer feedforward network to the input.
Shape suffixes
B: batch size L: sequence length D: model dimension F: feed-forward subnetwork hidden size
Source code in src/directmultistep/model/components/moe.py
__init__(hid_dim, ff_mult, ff_activation, dropout)
Initializes the Expert.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
hid_dim
|
int
|
The hidden dimension size (D). |
required |
ff_mult
|
int
|
The feed-forward expansion factor. |
required |
ff_activation
|
str
|
The activation function type. |
required |
dropout
|
float
|
The dropout rate. |
required |
Source code in src/directmultistep/model/components/moe.py
forward(x_BLD)
Forward pass of the Expert.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x_BLD
|
Tensor
|
The input tensor of shape (B, L, D). |
required |
Returns:
Type | Description |
---|---|
Tensor
|
The output tensor of shape (B, L, D). |
Source code in src/directmultistep/model/components/moe.py
NoisyTopkRouter
Bases: Module
Noisy top-k router for MoE.
Routes inputs to the top-k experts based on noisy logits.
Shape suffixes
B: batch size L: sequence length D: model dimension E: number of experts K: top_k
Source code in src/directmultistep/model/components/moe.py
__init__(hid_dim, n_experts, top_k)
Initializes the NoisyTopkRouter.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
hid_dim
|
int
|
The hidden dimension size (D). |
required |
n_experts
|
int
|
The number of experts (E). |
required |
top_k
|
int
|
The number of top experts to route to (K). |
required |
Source code in src/directmultistep/model/components/moe.py
forward(x_BLD)
Forward pass of the NoisyTopkRouter.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x_BLD
|
Tensor
|
The input tensor of shape (B, L, D). |
required |
Returns:
Type | Description |
---|---|
tuple[Tensor, Tensor]
|
A tuple containing: - The router output tensor of shape (B, L, E). - The indices of the top-k experts of shape (B, L, K). |
Source code in src/directmultistep/model/components/moe.py
PositionwiseFeedforwardLayer
Bases: Module
Positionwise feedforward layer.
Applies a two-layer feedforward network to the input.
Shape suffixes
B: batch size L: sequence length D: model dimension F: feed-forward subnetwork hidden size
Source code in src/directmultistep/model/components/moe.py
__init__(hid_dim, ff_mult, ff_activation, dropout)
Initializes the PositionwiseFeedforwardLayer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
hid_dim
|
int
|
The hidden dimension size (D). |
required |
ff_mult
|
int
|
The feed-forward expansion factor. |
required |
ff_activation
|
Module
|
The activation function. |
required |
dropout
|
float
|
The dropout rate. |
required |
Source code in src/directmultistep/model/components/moe.py
forward(x_BLD)
Forward pass of the PositionwiseFeedforwardLayer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x_BLD
|
Tensor
|
The input tensor of shape (B, L, D). |
required |
Returns:
Type | Description |
---|---|
Tensor
|
The output tensor of shape (B, L, D). |
Source code in src/directmultistep/model/components/moe.py
SparseMoE
Bases: Module
Sparse Mixture of Experts layer.
Routes inputs to a subset of experts and combines their outputs.
Shape suffixes
B: batch size L: sequence length D: model dimension E: number of experts K: top_k S: number of selected tokens for an expert
Source code in src/directmultistep/model/components/moe.py
159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 |
|
__init__(hid_dim, n_experts, top_k, ff_mult, ff_activation, dropout, capacity_factor)
Initializes the SparseMoE layer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
hid_dim
|
int
|
The hidden dimension size (D). |
required |
n_experts
|
int
|
The number of experts (E). |
required |
top_k
|
int
|
The number of top experts to route to (K). |
required |
ff_mult
|
int
|
The feed-forward expansion factor. |
required |
ff_activation
|
str
|
The activation function type. |
required |
dropout
|
float
|
The dropout rate. |
required |
capacity_factor
|
float
|
The capacity factor for each expert. |
required |
Source code in src/directmultistep/model/components/moe.py
forward(x_BLD)
Forward pass of the SparseMoE layer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x_BLD
|
Tensor
|
The input tensor of shape (B, L, D). |
required |
Returns:
Type | Description |
---|---|
Tensor
|
The output tensor of shape (B, L, D). |