This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Md Masud Rana, Department of Mathematics, University of Kentucky;
(2) Duc Duy Nguyen, Department of Mathematics, University of Kentucky & ducnguyen@uky.edu.
Conclusion, Data and Software Availability, Competing interests, Acknowledgments & References
Graph theory provides a mathematical framework that is widely applied in the study of biomolecules such as proteins, DNA, and RNA. For a biomolecule, a graph G(V, E) is a collection of nodes V and edges E that can represent the connectivity and relationships between different atoms or residues within the molecule.
A refinement to this representation is graph coloring, a technique that assigns unique labels to different atom types within the biomolecule. This enriched, colored graph encodes diverse atomic interactions, paving the way for a collective and coarse-grained description of the dataset. In this representation, atoms with assigned labels are organized into subgraphs, and the colored edges between them represent atom-specific interactions.
The advantage of using subgraphs lies in their ability to focus on specific regions or components of the biomolecule. By isolating relevant subsets of atoms, subgraphs allow us to identify localized patterns, interactions, or clusters that might not be evident in the global graph representation. This targeted approach provides a more nuanced understanding of the structural and functional properties of biomolecules.
To extract atom-level interaction information, we consider specific atom types based on their names in the PDB structure such as carbon alpha (CA), carbon beta (CB), carbon delta-1 (CD1), etc. These atom names serve as identifiers for specific positions within a protein’s three-dimensional structure. They help define the individual atoms that constitute
amino acids, the building blocks of proteins, and provide crucial information about their spatial orientation and chemical properties. We consider a total of 37 distinct atom names that are frequently found in protein structures within the PDB database. These atom types are represented by the set A.
To simplify the notation, we assume the set A is sorted in alphanumeric order,
A = {C, CA, CB, · · · , N, ND1, ND2, · · · , O, OD1, · · · , SD, SG},
serves as a measure of the combined strength of interaction between chosen pairs of atom types, providing valuable insights into the molecular structure and properties.
In the context of studying protein-protein interactions (PPIs) and predicting the effects of mutations on these interactions, it is important to focus on the relevant regions where the interactions occur.
While protein-protein complexes can consist of a large number of atoms, the interactions between proteins primarily take place at specific regions known as interfaces. To streamline computational costs and concentrate on pertinent information, it is common practice to consider only the protein atoms near the binding sites.
The binding site, in this context, refers to the region within a certain cutoff distance c from the chain where the mutation occurred. By defining the binding site in this way, we can narrow our focus to the specific area where the interaction and subsequent effects of the mutation are most pronounced.
Furthermore, when analyzing the effects of mutations, it is crucial to incorporate geometric graph information from the mutation sites and their neighboring regions. The mutation site is defined as the region within a cutoff distance c from the mutated residue, allowing us to capture the structural changes resulting from the mutation.
To construct a site-specific multiscale weighted colored geometric subgraph (MWCGS) representation for a PPI, both the wild-type and mutant-type proteins are considered. This leads to four sets of features for each PPI, corresponding to the two sites and the two types of proteins involved. Each set consists of 37 × 37 = 1369 MWCGS features, representing the interactions between the atom types involved in the PPI.
These features encompass diverse chemical and biological properties, such as the presence of specific interatomic interactions involving oxygen and nitrogen atoms, the hydrophobic nature of certain regions, and the ability of atoms to undergo polarization, among other relevant molecular characteristics.
By utilizing these site-specific MWCGS features, we can uncover valuable insights into the effects of mutations and the underlying molecular interactions, revealing significant information and characteristics embedded within the PPI system.
Accurately predicting the changes in binding affinity induced by mutations in protein-protein complexes poses a significant challenge due to the complex nature of these systems. The interactions between proteins are highly intricate, and the effects of mutations can be subtle and context-dependent.
Machine learning techniques offer a promising approach to tackle this problem by leveraging the power of data-driven models to capture complex patterns and relationships.
Machine learning algorithms can aid in predicting mutation-induced binding affinity changes by learning from a set of training examples that consist of protein-protein complexes with known experimental binding affinities.
These algorithms can analyze the features extracted from the complexes, such as geometric graph information, to identify relevant patterns and associations between the features and the binding affinities. By learning from these patterns, the algorithms can generalize and make predictions on unseen protein-protein complexes.
There are several machine learning algorithms that can be used in combination with geometric graph features to predict binding affinity changes. These algorithms include random forests [49], support vector machines (SVM) [50], neural networks [51], and gradient boosting trees (GBT) [52]. Each algorithm has its strengths and weaknesses, and their performance can vary depending on the specific problem and dataset.
Among these algorithms, gradient boosting trees (GBT) have gained significant popularity in recent years [39]. GBT is an ensemble method that builds a sequence of weak learners, typically decision trees, to correct the errors made by the previous learners.
By combining these weak learners, GBT can effectively model complex relationships and improve prediction accuracy. One advantage of GBT is its robustness against overfitting, which is especially beneficial when dealing with a moderate number of features.
Additionally, GBT models can provide interpretability, allowing us to gain insights into the factors contributing to the binding affinity changes.
The implementation of the GBT algorithm in this study utilized the scikit-learn package (v 0.24.1). To optimize the performance of the GBT model for ensemble methods, specific hyperparameters were fine-tuned.
The number of estimators was set to 40000, indicating the number of weak learners in the ensemble, while the learning rate was set to 0.001, determining the contribution of each weak learner to the final prediction.
Given the large number of features involved in the prediction task, an efficient training process was achieved by limiting the maximum number of features considered to the square root of the descriptor length. This approach helped expedite the training process without compromising the overall performance of the GBT model.
To ensure reliable performance evaluation, fifty runs were performed for each feature set, employing different random seeds. By averaging the results obtained from these runs, a more robust and representative performance measure was obtained.
Despite the complexity of the prediction task and the involvement of numerous features, the selected parameter settings and multiple runs yielded satisfactory performance results.
The GBT approach was chosen for its ability to effectively handle overfitting, exhibit good performance with moderately sized datasets, and provide interpretable models.
These characteristics make GBT a suitable and reliable choice for this study, enabling accurate predictions of mutation-induced binding affinity changes in protein-protein complexes using the provided geometric graph features.
This paper is available on Arxiv under CC 4.0 license.