Abstract
| - Decision trees have been used extensively in cheminformatics for modeling various biochemical endpointsincluding receptor−ligand binding, ADME properties, environmental impact, and toxicity. The traditionalapproach to inducing decision trees based upon a given training set of data involves recursive partitioningwhich selects partitioning variables and their values in a greedy manner to optimize a given measure ofpurity. This methodology has numerous benefits including classifier interpretability and the capability ofmodeling nonlinear relationships. The greedy nature of induction, however, may fail to elucidate underlyingrelationships between the data and endpoints. Using evolutionary programming, decision trees are inducedwhich are significantly more accurate than trees induced by recursive partitioning. Furthermore, when assessedon previously unseen data in a 10-fold cross-validated manner, evolutionary programming induced treesexhibit a significantly higher accuracy on previously unseen data. This methodology is compared to single-tree and multiple-tree recursive partitioning in two domains (aerobic biodegradability and hepatotoxicity)and shown to produce less complex classifiers with average increases in predictive accuracy of 5−10%over the traditional method.
|