Decoder
This document describes the decoder components used in the DMS model.
Base Decoder Layer
The basic building block of the decoder that processes target sequences.
Components
Self-Attention Block
- Multi-head self-attention mechanism
- Causal masking to prevent looking ahead
- Layer normalization
- Residual connection
Cross-Attention Block
- Multi-head attention over encoder outputs
- Allows decoder to focus on relevant input parts
- Layer normalization
- Residual connection
Feed-Forward Block
- Two-layer feed-forward network
- Configurable activation function (ReLU or GELU)
- Layer normalization
- Residual connection
Source Code
directmultistep.model.components.decoder
Decoder
Bases: Module
The decoder module.
Shape suffixes convention
B: batch size C: the length of the input on which conditioning is done (in our case input_max_length) L: sequence length for decoder, in our case output_max_length D: model dimension (sometimes called d_model or embedding_dim) V: vocabulary size
Source code in src/directmultistep/model/components/decoder.py
146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 |
|
__init__(vocab_dim, hid_dim, context_window, n_layers, n_heads, dropout, attn_bias, ff_mult, ff_activation)
Initializes the Decoder.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
vocab_dim
|
int
|
The vocabulary size. |
required |
hid_dim
|
int
|
The hidden dimension size. |
required |
context_window
|
int
|
The context window size. |
required |
n_layers
|
int
|
The number of decoder layers. |
required |
n_heads
|
int
|
The number of attention heads. |
required |
dropout
|
float
|
The dropout rate. |
required |
attn_bias
|
bool
|
Whether to use bias in the attention layers. |
required |
ff_mult
|
int
|
The feed-forward expansion factor. |
required |
ff_activation
|
str
|
The activation function type. |
required |
Source code in src/directmultistep/model/components/decoder.py
forward(trg_BL, enc_src_BCD, src_mask_B11C, trg_mask_B1LL=None)
Forward pass of the Decoder.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
trg_BL
|
Tensor
|
The target sequence tensor of shape (B, L). |
required |
enc_src_BCD
|
Tensor
|
The encoder output tensor of shape (B, C, D). |
required |
src_mask_B11C
|
Tensor
|
The source mask tensor of shape (B, 1, 1, C). |
required |
trg_mask_B1LL
|
Tensor | None
|
The target mask tensor of shape (B, 1, L, L). |
None
|
Returns:
Type | Description |
---|---|
Tensor
|
The output tensor of shape (B, L, V). |
Source code in src/directmultistep/model/components/decoder.py
DecoderLayer
Bases: Module
A single layer of the decoder.
Shape suffixes convention
B: batch size C: the length of the input on which conditioning is done (in our case input_max_length) L: sequence length for decoder, in our case output_max_length D: model dimension (sometimes called d_model or embedding_dim)
Source code in src/directmultistep/model/components/decoder.py
__init__(hid_dim, n_heads, dropout, attn_bias, ff_mult, ff_activation)
Initializes the DecoderLayer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
hid_dim
|
int
|
The hidden dimension size. |
required |
n_heads
|
int
|
The number of attention heads. |
required |
dropout
|
float
|
The dropout rate. |
required |
attn_bias
|
bool
|
Whether to use bias in the attention layers. |
required |
ff_mult
|
int
|
The feed-forward expansion factor. |
required |
ff_activation
|
str
|
The activation function type. |
required |
Source code in src/directmultistep/model/components/decoder.py
forward(trg_BLD, enc_src_BCD, src_mask_B11C, trg_mask_B1LL)
Forward pass of the DecoderLayer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
trg_BLD
|
Tensor
|
The target sequence tensor of shape (B, L, D). |
required |
enc_src_BCD
|
Tensor
|
The encoder output tensor of shape (B, C, D). |
required |
src_mask_B11C
|
Tensor
|
The source mask tensor of shape (B, 1, 1, C). |
required |
trg_mask_B1LL
|
Tensor
|
The target mask tensor of shape (B, 1, L, L). |
required |
Returns:
Type | Description |
---|---|
Tensor
|
The output tensor of shape (B, L, D). |
Source code in src/directmultistep/model/components/decoder.py
MoEDecoder
Bases: Decoder
The decoder module with Mixture of Experts in the feedforward layers.
Source code in src/directmultistep/model/components/decoder.py
__init__(vocab_dim, hid_dim, context_window, n_layers, n_heads, dropout, attn_bias, ff_mult, ff_activation, n_experts, top_k, capacity_factor)
Initializes the MoEDecoder.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
vocab_dim
|
int
|
The vocabulary size. |
required |
hid_dim
|
int
|
The hidden dimension size. |
required |
context_window
|
int
|
The context window size. |
required |
n_layers
|
int
|
The number of decoder layers. |
required |
n_heads
|
int
|
The number of attention heads. |
required |
dropout
|
float
|
The dropout rate. |
required |
attn_bias
|
bool
|
Whether to use bias in the attention layers. |
required |
ff_mult
|
int
|
The feed-forward expansion factor. |
required |
ff_activation
|
str
|
The activation function type. |
required |
n_experts
|
int
|
The number of experts in the MoE layer. |
required |
top_k
|
int
|
The number of experts to use in the MoE layer. |
required |
capacity_factor
|
float
|
The capacity factor for the MoE layer. |
required |
Source code in src/directmultistep/model/components/decoder.py
MoEDecoderLayer
Bases: DecoderLayer
A single layer of the decoder with Mixture of Experts in the feedforward layer.
Source code in src/directmultistep/model/components/decoder.py
__init__(hid_dim, n_heads, dropout, attn_bias, ff_mult, ff_activation, n_experts, top_k, capacity_factor)
Initializes the MoEDecoderLayer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
hid_dim
|
int
|
The hidden dimension size. |
required |
n_heads
|
int
|
The number of attention heads. |
required |
dropout
|
float
|
The dropout rate. |
required |
attn_bias
|
bool
|
Whether to use bias in the attention layers. |
required |
ff_mult
|
int
|
The feed-forward expansion factor. |
required |
ff_activation
|
str
|
The activation function type. |
required |
n_experts
|
int
|
The number of experts in the MoE layer. |
required |
top_k
|
int
|
The number of experts to use in the MoE layer. |
required |
capacity_factor
|
float
|
The capacity factor for the MoE layer. |
required |