Multistep Route Post-processing
This module provides useful data structure classes and helper functions for postprocessing beam search results and multistep routes generated by DirectMultiStep models.
Example Use
The most useful functions are canonicalize_path_dict
, canonicalize_path_string
, and functions that start with find_
from directmultistep.utils.pre_process import stringify_dict
from directmultistep.utils.post_process import canonicalize_path_dict, canonicalize_path_string
path_string = "{'smiles':'CNCc1cc(-c2ccccc2F)n(S(=O)(=O)c2cccnc2)c1','children':[{'smiles':'O=Cc1cc(-c2ccccc2F)n(S(=O)(=O)c2cccnc2)c1','children':[{'smiles':'O=Cc1c[nH]c(-c2ccccc2F)c1'},{'smiles':'O=S(=O)(Cl)c1cccnc1'}]},{'smiles':'CN'}]}"
cano_path_dict = canonicalize_path_dict(eval(path_string))
cano_path_string = stringify_dict(cano_path_dict)
print(cano_path_string == canonicalize_path_string(path_string))
Source Code
directmultistep.utils.post_process
count_unsolved_targets(beam_results_NS2)
Counts the number of unsolved targets in a list of beam results.
An unsolved target is defined as a target for which the list of paths is empty. Note that this differs from the typical definition of a solved target. Typically, solved targets are defined as targets with routes where all starting materials (SMs) are in a given stock compound set.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
beam_results_NS2
|
BeamResultType | PathsProcessedType
|
A list of beam results, where each beam result is a list of paths. |
required |
Returns:
Type | Description |
---|---|
int
|
The number of unsolved targets. |
Source code in src/directmultistep/utils/post_process.py
find_valid_paths(beam_results_NS2)
Finds valid paths from beam search results.
This function processes beam search results, extracts the path string, canonicalizes the SMILES strings of the reactants, and returns a list of valid paths with canonicalized SMILES.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
beam_results_NS2
|
BeamResultType
|
A list of beam results, where each beam result is a list of (path_string, score) tuples. |
required |
Returns:
Type | Description |
---|---|
PathsProcessedType
|
A list of valid paths, where each path is a tuple of |
PathsProcessedType
|
(canonicalized_path_string, list_of_canonicalized_reactant_SMILES). |
Source code in src/directmultistep/utils/post_process.py
find_matching_paths(paths_NS2n, correct_paths, ignore_ids=None)
Finds matching paths between predicted paths and correct paths.
This function compares predicted paths with a list of correct paths and returns the rank at which the correct path was found. It also checks for matches after considering all permutations of the predicted path.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
paths_NS2n
|
PathsProcessedType
|
A list of predicted paths, where each path is a list of (path_string, list_of_reactant_SMILES) tuples. |
required |
correct_paths
|
list[str]
|
A list of correct path strings. |
required |
ignore_ids
|
set[int] | None
|
A set of indices to ignore during matching. |
None
|
Returns:
Type | Description |
---|---|
tuple[MatchList, MatchList]
|
A tuple containing two lists: - match_accuracy_N: List of ranks at which the correct path was found (None if not found). - perm_match_accuracy_N: List of ranks at which the correct path was found after considering permutations (None if not found). |
Source code in src/directmultistep/utils/post_process.py
find_top_n_accuracy(match_accuracy, n_vals, dec_digs=1)
Calculates top-n accuracy for a list of match ranks.
This function calculates the fraction of paths that were found within the top-n ranks for a given list of n values.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
match_accuracy
|
MatchList
|
A list of ranks at which the correct path was found (None if not found). |
required |
n_vals
|
list[int]
|
A list of n values for which to calculate top-n accuracy. |
required |
dec_digs
|
int
|
The number of decimal digits to round the accuracy to. |
1
|
Returns:
Type | Description |
---|---|
dict[str, str]
|
A dictionary mapping "Top n" to the corresponding accuracy fraction |
dict[str, str]
|
(as a string). |
Source code in src/directmultistep/utils/post_process.py
remove_repetitions_within_beam_result(paths_NS2n)
Removes duplicate paths within each beam result.
This function iterates through each beam result and removes duplicate paths based on their stringified representation after considering all permutations.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
paths_NS2n
|
PathsProcessedType
|
A list of beam results, where each beam result is a list of (path_string, list_of_reactant_SMILES) tuples. |
required |
Returns:
Type | Description |
---|---|
PathsProcessedType
|
A list of beam results with duplicate paths removed. |
Source code in src/directmultistep/utils/post_process.py
find_paths_with_commercial_sm(paths_NS2n, commercial_stock)
Finds paths that use only commercially available starting materials.
This function filters a list of paths, keeping only those where all reactants are present in the provided commercial stock.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
paths_NS2n
|
PathsProcessedType
|
A list of beam results, where each beam result is a list of (path_string, list_of_reactant_SMILES) tuples. |
required |
commercial_stock
|
set[str]
|
A set of SMILES strings representing commercially available starting materials. |
required |
Returns:
Type | Description |
---|---|
PathsProcessedType
|
A list of beam results containing only paths with commercial starting |
PathsProcessedType
|
materials. |
Source code in src/directmultistep/utils/post_process.py
find_paths_with_correct_product_and_reactants(paths_NS2n, true_products, true_reacs=None)
Finds paths that have the correct product and, optionally, the correct reactants.
This function filters a list of paths, keeping only those where the product SMILES matches the corresponding true product SMILES, and optionally, where at least one of the reactants matches the corresponding true reactant SMILES.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
paths_NS2n
|
PathsProcessedType
|
A list of beam results, where each beam result is a list of (path_string, list_of_reactant_SMILES) tuples. |
required |
true_products
|
list[str]
|
A list of SMILES strings representing the correct products. |
required |
true_reacs
|
list[str] | None
|
An optional list of SMILES strings representing the correct reactants. |
None
|
Returns:
Type | Description |
---|---|
PathsProcessedType
|
A list of beam results containing only paths with the correct product |
PathsProcessedType
|
and reactants (if provided). |
Source code in src/directmultistep/utils/post_process.py
canonicalize_path_dict(path_dict)
Canonicalizes a FilteredDict representing a path.
This function recursively canonicalizes the SMILES strings in a FilteredDict and its children.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path_dict
|
FilteredDict
|
A FilteredDict representing a path. |
required |
Returns:
Type | Description |
---|---|
FilteredDict
|
A FilteredDict with canonicalized SMILES strings. |
Source code in src/directmultistep/utils/post_process.py
canonicalize_path_string(path_string)
Canonicalizes a path string.
This function converts a path string to a FilteredDict, canonicalizes it, and then converts it back to a string.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path_string
|
str
|
A string representing a path. |
required |
Returns:
Type | Description |
---|---|
str
|
A canonicalized string representation of the path. |
Source code in src/directmultistep/utils/post_process.py
process_paths(paths_NS2n, true_products, true_reacs=None, commercial_stock=None)
Processes a list of paths by canonicalizing, removing repetitions, and filtering.
This function performs a series of processing steps on a list of paths, including canonicalization, removal of repetitions, filtering by commercial availability, and filtering by correct product and reactants.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
paths_NS2n
|
PathsProcessedType
|
A list of beam results, where each beam result is a list of (path_string, list_of_reactant_SMILES) tuples. |
required |
true_products
|
list[str]
|
A list of SMILES strings representing the correct products. |
required |
true_reacs
|
list[str] | None
|
An optional list of SMILES strings representing the correct reactants. |
None
|
commercial_stock
|
set[str] | None
|
An optional set of SMILES strings representing commercially available starting materials. |
None
|
Returns:
Type | Description |
---|---|
tuple[PathsProcessedType, dict[str, int]]
|
A tuple containing: - A list of beam results containing only the correct paths. - A dictionary containing the number of solved targets at each stage of processing. |
Source code in src/directmultistep/utils/post_process.py
process_path_single(paths_NS2n, true_products, true_reacs=None, commercial_stock=None)
Processes a list of paths by canonicalizing, removing repetitions, and filtering.
This function performs a series of processing steps on a list of paths,
including canonicalization, removal of repetitions, filtering by commercial
availability, and filtering by correct product and reactants.
This function is similar to process_paths
but does not return the
solvability dictionary.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
paths_NS2n
|
PathsProcessedType
|
A list of beam results, where each beam result is a list of (path_string, list_of_reactant_SMILES) tuples. |
required |
true_products
|
list[str]
|
A list of SMILES strings representing the correct products. |
required |
true_reacs
|
list[str] | None
|
An optional list of SMILES strings representing the correct reactants. |
None
|
commercial_stock
|
set[str] | None
|
An optional set of SMILES strings representing commercially available starting materials. |
None
|
Returns:
Type | Description |
---|---|
PathsProcessedType
|
A list of beam results containing only the correct paths. |
Source code in src/directmultistep/utils/post_process.py
process_paths_post(paths_NS2n, true_products, true_reacs, commercial_stock)
Processes a list of paths by removing repetitions, filtering, and canonicalizing.
This function performs a series of processing steps on a list of paths, including removal of repetitions, filtering by commercial availability, filtering by correct product and reactants, and canonicalization.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
paths_NS2n
|
PathsProcessedType
|
A list of beam results, where each beam result is a list of (path_string, list_of_reactant_SMILES) tuples. |
required |
true_products
|
list[str]
|
A list of SMILES strings representing the correct products. |
required |
true_reacs
|
list[str]
|
A list of SMILES strings representing the correct reactants. |
required |
commercial_stock
|
set[str]
|
A set of SMILES strings representing commercially available starting materials. |
required |
Returns:
Type | Description |
---|---|
PathsProcessedType
|
A list of beam results containing only the correct paths, canonicalized. |
Source code in src/directmultistep/utils/post_process.py
calculate_top_k_counts_by_step_length(match_accuracy, n_steps_list, k_vals)
Calculate accuracy statistics grouped by number of steps.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
match_accuracy
|
list[int | None]
|
List of ranks at which each path was found (None if not found) |
required |
n_steps_list
|
list[int]
|
List of number of steps for each path |
required |
k_vals
|
list[int]
|
List of k values to calculate top-k accuracy for |
required |
Returns:
Type | Description |
---|---|
dict[int, dict[str, int]]
|
Dictionary mapping step count to accuracy statistics |