Multistep Route Pre-processing
This module provides useful data structure classes and helper functions for preprocessing multistep routes for training and testing DirectMultiStep models.
Example Use
The most frequently used data structure is FilteredDict
, a dictionary format for multistep routes used in DirectMultiStep models. Several useful functions are available, such as canonicalize_smiles
, max_tree_depth
, find_leaves
, stringify_dict
, and generate_permutations
, among others. For example:
from directmultistep.utils.pre_process import stringify_dict
path_string = "{'smiles':'CNCc1cc(-c2ccccc2F)n(S(=O)(=O)c2cccnc2)c1','children':[{'smiles':'O=Cc1cc(-c2ccccc2F)n(S(=O)(=O)c2cccnc2)c1','children':[{'smiles':'O=Cc1c[nH]c(-c2ccccc2F)c1'},{'smiles':'O=S(=O)(Cl)c1cccnc1'}]},{'smiles':'CN'}]}"
# This should evaluate to True, as it compares the stringified version of your FilteredDict
print(stringify_dict(eval(path_string)) == path_string)
Source Code
directmultistep.utils.pre_process
PaRoutesDict = dict[str, str | bool | list['PaRoutesDict']]
module-attribute
FilteredDict
Bases: TypedDict
A dictionary format for multistep routes, used in DirectMultiStep models.
This dictionary is designed to represent a node in a synthetic route tree.
It contains the SMILES string of a molecule and a list of its child nodes.
To get its string format, use stringify_dict
.
Attributes:
Name | Type | Description |
---|---|---|
smiles |
str
|
SMILES string of the molecule. |
children |
list[FilteredDict]
|
List of child nodes, each a FilteredDict. |
Source code in src/directmultistep/utils/pre_process.py
filter_mol_nodes(node)
Filters a PaRoutes dictionary to keep only 'smiles' and 'children' keys.
This function removes extra information like 'metadata', 'rsmi', and 'reaction_hash', keeping only the 'smiles' and 'children' keys. It also canonicalizes the SMILES string using RDKit.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
node
|
PaRoutesDict
|
A dictionary representing a node in a PaRoutes data structure. |
required |
Returns:
Type | Description |
---|---|
FilteredDict
|
A FilteredDict containing the canonicalized SMILES and filtered children. |
Raises:
Type | Description |
---|---|
ValueError
|
If the 'type' of the node is not 'mol' or if 'children' is not a list. |
Source code in src/directmultistep/utils/pre_process.py
max_tree_depth(node)
Calculates the maximum depth of a synthetic route tree.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
node
|
FilteredDict
|
A FilteredDict representing a node in the route tree. |
required |
Returns:
Type | Description |
---|---|
int
|
The maximum depth of the tree. Returns 0 for a leaf node. |
Source code in src/directmultistep/utils/pre_process.py
find_leaves(node)
Finds the SMILES strings of all leaf nodes (starting materials) in a route tree.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
node
|
FilteredDict
|
A FilteredDict representing a node in the route tree. |
required |
Returns:
Type | Description |
---|---|
list[str]
|
A list of SMILES strings representing the starting materials. |
Source code in src/directmultistep/utils/pre_process.py
canonicalize_smiles(smiles)
Canonicalizes a SMILES string using RDKit.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
smiles
|
str
|
The SMILES string to canonicalize. |
required |
Returns:
Type | Description |
---|---|
str
|
The canonicalized SMILES string. |
Raises:
Type | Description |
---|---|
ValueError
|
If the SMILES string cannot be parsed by RDKit. |
Source code in src/directmultistep/utils/pre_process.py
stringify_dict(data)
Converts a FilteredDict to a string, removing spaces.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
FilteredDict
|
The FilteredDict to convert. |
required |
Returns:
Type | Description |
---|---|
str
|
A string representation of the FilteredDict with no spaces. |
Source code in src/directmultistep/utils/pre_process.py
generate_permutations(data, max_perm=None)
Generates permutations of a synthetic route by permuting the order of children.
This function generates all possible permutations of a synthetic route by rearranging the order of child nodes at each level of the tree. It can optionally limit the number of permutations generated.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
FilteredDict
|
A FilteredDict representing the synthetic route. |
required |
max_perm
|
int | None
|
An optional integer to limit the number of permutations generated. |
None
|
Returns:
Type | Description |
---|---|
list[str]
|
A list of stringified FilteredDicts representing the permuted routes. |
Source code in src/directmultistep/utils/pre_process.py
is_convergent(route)
Determines if a synthesis route is convergent (non-linear).
A route is linear if for every transformation, at most one reactant has children (i.e., all other reactants are leaf nodes). A route is convergent if there exists at least one transformation where two or more reactants have children.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
route
|
FilteredDict
|
The synthesis route to analyze. |
required |
Returns:
Type | Description |
---|---|
bool
|
True if the route is convergent (non-linear), False if it's linear. |