The RDI is using a variety of computational modelling techniques to explore the relation between variations throughout the protease and reverse transcriptase genes and drug susceptibility. The main techniques that are currently being employed are artificial neural networks, random forests and support vector machines. These models are trained using large amounts of data to predict the virological response to combination antiretroviral therapy.
The models are trained using data from large numbers of TCEs from the RDI database. The following input data are provided and the models trained to predict the single output variable of follow-up viral load:
- Baseline viral load
- Baseline genotype
- Baseline CD4 count
- Treatment history information
- Drugs in new regimen
- Time to follow-up
Once trained the models are tested using the input variables from an independent test dataset . The models’ predictions of virological response for these test cases are compared to the actual virological responses in terms of the correlation and mean absolute difference between them.
More details of computational modelling
Artificial neural networks (ANN)
An ANN model consists of several layers of neural units that are connected from the input layer to hidden layers and from hidden layers to the output layer. The relationship between the follow-up viral load and the baseline information is expressed by the weights on the connections between the neural units. The weights are adjusted during the training procedure and the final values of the weights are obtained by minimising an error function. Theoretically, 3-layer neural networks can be used to approximate any function. Therefore, we used only 1-hidden-layer neural networks. A cross-validation scheme is used in the ANN modelling to assess the accuracy of the models during training and generate ANN committee member models.
Random Forests (RF)
An RF model consists of an ensemble of individual trees. The individual trees are built using different sets of samples from the original training dataset. In each node of a tree, the splitting feature is selected from a randomly chosen sample of features. There is no need for cross-validation in RF modelling because the training dataset of the individual trees are built by bootstrap replication, this leaves about one-third of the samples out of the bootstrap sample, which can be used for the validation purpose. The outputs of all trees are aggregated to produce a final prediction.
Support Vector Machines (SVM)
The principle of SVM is to map the data into a high-dimensional feature space and perform linear regression in this space. SVM searches for a global solution, and does not control model complexity by keeping the number of input variables small. It is thought to be more resistant to ‘over-fitting’ based on the training data set and, hence potentially more generalisable to new data. The drawbacks of SVM are its high algorithmic complexity and the length of time taken for training.