Torch Dataset for Routes
This module provides a custom PyTorch Dataset class for handling reaction routes. It includes functionalities for tokenizing SMILES strings, reaction paths, and context information, as well as preparing data for training and generation.
Example Use
tokenize_path_string
is the most important function. It tokenizes a reaction path string. It uses a regular expression to split the string into tokens, and it can optionally add start-of-sequence (<SOS>
) and end-of-sequence (?
) tokens.
from directmultistep.utils.dataset import tokenize_path_string
path_string = "{'smiles':'CC','children':[{'smiles':'CC(=O)O'}]}"
tokens = tokenize_path_string(path_string)
print(tokens)
Notes on Path Start
In the RoutesDataset
class, the get_generation_with_sm
and get_generation_no_sm
methods return an initial path tensor. This tensor is created from a path_start
string, which is a partial path string that the model will start generating from. The path_start
is "{'smiles': 'product_smiles', 'children': [{'smiles':"
. The model will generate the rest of the path string from this starting point.
This design is important because a trained model always generates this path_start
at the beginning of the sequence. By providing this as the initial input, we avoid wasting time generating this part and can focus on generating the rest of the reaction path.
The prepare_input_tensors
function in directmultistep.generate
allows for the provision of a custom path_start
string. This is useful when you want to initiate the generation process from a specific point in the reaction path, instead of the default starting point. By modifying the path_start
argument, you can control the initial state of the generation and explore different reaction pathways with user-defined intermediates.
Source Code
directmultistep.utils.dataset
RoutesDataset
Bases: RoutesProcessing
Dataset for multi-step reaction routes.
Source code in src/directmultistep/utils/dataset.py
158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 |
|
__getitem__(index)
Retrieves an item from the dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index
|
int
|
The index of the item to retrieve. |
required |
Returns:
Type | Description |
---|---|
tuple[Tensor, ...]
|
A tuple of tensors representing the input and output data. |
Source code in src/directmultistep/utils/dataset.py
__init__(metadata_path, products, path_strings, n_steps_list, starting_materials=None, mode='training', name_idx=None)
Initializes the RoutesDataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
metadata_path
|
Path
|
Path to the metadata file (YAML). |
required |
products
|
list[str]
|
List of product SMILES strings. |
required |
path_strings
|
list[str]
|
List of reaction path strings. |
required |
n_steps_list
|
list[int]
|
List of integers representing the number of steps in each path. |
required |
starting_materials
|
list[str] | None
|
List of starting material SMILES strings. |
None
|
mode
|
str
|
Either "training" or "generation". |
'training'
|
name_idx
|
dict[str, list[int]] | None
|
A dictionary mapping names to lists of indices. |
None
|
Source code in src/directmultistep/utils/dataset.py
__len__()
__repr__()
Returns a string representation of the dataset.
get_generation_no_sm(index)
Retrieves a generation item without starting materials.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index
|
int
|
The index of the item to retrieve. |
required |
Returns:
Type | Description |
---|---|
tuple[Tensor, ...]
|
A tuple of tensors: encoder input, step length, and initial path tensor. |
Source code in src/directmultistep/utils/dataset.py
get_generation_with_sm(index)
Retrieves a generation item with starting materials.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index
|
int
|
The index of the item to retrieve. |
required |
Returns:
Type | Description |
---|---|
tuple[Tensor, ...]
|
A tuple of tensors: encoder input, step length, and initial path tensor. |
Source code in src/directmultistep/utils/dataset.py
get_training_no_sm(index)
Retrieves a training item without starting materials.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index
|
int
|
The index of the item to retrieve. |
required |
Returns:
Type | Description |
---|---|
tuple[Tensor, ...]
|
A tuple of tensors: encoder input, decoder input, and step length. |
Source code in src/directmultistep/utils/dataset.py
get_training_with_sm(index)
Retrieves a training item with starting materials.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index
|
int
|
The index of the item to retrieve. |
required |
Returns:
Type | Description |
---|---|
tuple[Tensor, ...]
|
A tuple of tensors: encoder input, decoder input, and step length. |
Source code in src/directmultistep/utils/dataset.py
tokenize_smile(smile)
Tokenizes a SMILES string by character.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
smile
|
str
|
The SMILES string to tokenize. |
required |
Returns:
Type | Description |
---|---|
list[str]
|
A list of tokens, including start and end of sequence tokens. |
Source code in src/directmultistep/utils/dataset.py
tokenize_smile_atom(smile, has_atom_types, mask=False)
Tokenizes a SMILES string, considering atom types of up to two characters.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
smile
|
str
|
The SMILES string to tokenize. |
required |
has_atom_types
|
list[str]
|
A list of atom types to consider (e.g., ["Cl", "Br"]). |
required |
mask
|
bool
|
If True, replaces all atom tokens with "J". |
False
|
Returns:
Type | Description |
---|---|
list[str]
|
A list of tokens, including start and end of sequence tokens. |
Source code in src/directmultistep/utils/dataset.py
tokenize_context(context_list)
Tokenizes a list of context strings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
context_list
|
list[str]
|
A list of context strings to tokenize. |
required |
Returns:
Type | Description |
---|---|
list[str]
|
A list of tokens, including context start, separator, and end tokens. |
Source code in src/directmultistep/utils/dataset.py
tokenize_path_string(path_string, add_sos=True, add_eos=True)
Tokenizes a path string based on a regular expression.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path_string
|
str
|
The path string to tokenize. |
required |
add_sos
|
bool
|
If True, adds a start of sequence token. |
True
|
add_eos
|
bool
|
If True, adds an end of sequence token. |
True
|
Returns:
Type | Description |
---|---|
list[str]
|
A list of tokens. |
Source code in src/directmultistep/utils/dataset.py
directmultistep.generate
prepare_input_tensors(target, n_steps, starting_material, rds, product_max_length, sm_max_length)
Prepare input tensors for the model. Args: target: SMILES string of the target molecule. n_steps: Number of synthesis steps. starting_material: SMILES string of the starting material, if any. rds: RoutesProcessing object for tokenization. product_max_length: Maximum length of the product SMILES sequence. sm_max_length: Maximum length of the starting material SMILES sequence. use_fp16: Whether to use half precision (FP16) for tensors. path_start: Initial path string to start generation from. Returns: A tuple containing: - encoder_inp: Input tensor for the encoder. - steps_tens: Tensor of the number of steps, or None if not provided. - path_tens: Initial path tensor for the decoder.