Overview


  Modeling process

1. Prepare the datasets

This server predicted 8 kinds of commonly used basic chemical properties. To predicted these properties, we search for as many as data sources from the public databases and scientific papers. Finally, we collected the 8 datasets and then employed several advanced QSAR methodologies to estimate these properties.


1.1 Density

The density is defined as the mass per unit volume. The data set for this endpoint was obtained from the density data contained in LookChem (Lookchem.com 2011). The data set was restricted to chemicals with boiling points greater than 25°C (or the boiling point was unavailable). The data set was further restricted to chemical with densities greater than 0.5 and less than 5 g/cm3. The final dataset consisted of 8909 chemicals. The data in lookchem.com is not peer reviewed but the set is very large and thus provides a large degree of structural diversity. The modeled property was the density in g/cm3.


1.2 Flash point

The flash point is defined as the lowest temperature at which a chemical can vaporize to form an ignitable mixture in air. A dataset of 8362 chemicals was also compiled from lookchem.com (Lookchem.com 2011). Chemicals with flash points greater than 1000°C were omitted from the data set. The modeled property was the flash point in °C.


1.3 Viscosity

Viscosity is a measure of the resistance of a fluid to flow in cP defined as the proportionality constant between shear rate and shear stress). The viscosity at 25°C for 557 chemicals was obtained from the data compilations of Viswanath and Riddick (Viswanath 1989; Riddick et al. 1986).


1.4 Surface tension

Surface tension is a property of the surface of a liquid that allows it to resist an external force. The surface tension at 25°C for 1416 chemicals was obtained from the data compilation of Jaspar (Jasper 1972). The experimental values (at 25°C) are estimated using an empirical correlation which is fit to experimental data from Jaspar: surface tension = A - BT. The estimated experimental surface tension value is only used if the closest experimental data point is within 10°C of 25°C. The modeled property was the surface tension in dyn/cm.


1.5 Vapor pressure

Vapor pressure is defined as the pressure of a vapor in mmHg in thermodynamic equilibrium with its condensed phases in a closed system. The vapor pressure at 25°C for 2511 chemicals was obtained from the database in EPI Suite (USEPA 2009). The modeled property was Log10(vapor pressure mmHg).


1.6 Melting point

Melting point (the temperature in °C at which a chemical in the solid state changes to a liquid state). The melting point for 9385 chemicals was obtained from the database in EPI Suite (USEPA 2009). The modeled property was Log10(vapor pressure mmHg).


1.7 Water solubility

Water solubility is defined as the amount of a chemical in that will dissolve in liquid water to form a homogeneous solution. A dataset of 5020 chemicals was compiled from the database in EPI Suite (USEPA 2009). Chemicals with water solubilities exceeding 1,000,000 mg/L were omitted from the overall dataset. In addition the data was limited to data points that are within 10°C of 25°C. The water solubility is an important property because sometimes the predicted LC50 values for aquatic species can exceed the water solubility. The modeled property was -Log10(water solubility mol/L).


1.8 Normal boiling point

The normal boiling point is defined as the temperature at which a chemical boils at atmospheric pressure. The data set for this endpoint was obtained from the boiling point data contained in EPI Suite (USEPA 2009). 41 chemicals were removed from the data set since they were previously shown to be badly predicted and had experimental values which were significantly different (>50K) from other sources such as NIST(NIST 2010) and LookChem (Lookchem.com 2011). The final data set contained 5759 chemicals. The modeled property was the boiling point in °C.


1.9 TPSA, MR and LogP

Molecular polar surface area (PSA), i.e., surface belonging to polar atoms, is a descriptor that was shown to correlate well with passive molecular transport through membranes and, therefore, allows prediction of transport properties of drugs. Topology molecular surface area (TPSA) is a new approach for the calculation of the PSA for its fast calculation (J. Med. Chem., 2000, 43 (20), pp 3714–3717). MR is short for the Wildman-Crippen Molar refractivity value, and Log P represents partition coefficient of molecules (J. Chem. Inf. Comput. Sci., 1999, 39 (5), pp 868–873). These three properties were calculated by using RDKit packages.


2. Descriptor calculation

To represent these molecules, we calculated 378 2D molecular descriptors from 11 feature groups using online molecular calculation platforms: ChemDes (http://www.scbdd.com/chemdes) and BioTriangle (http://biotriangle.scbdd.com). By using ChemDes and BioTriangle, the calculated values can be downloaded as a well-organized *.csv file, which could be used directly as inputs for the next modelling step. The 359 descriptors are: 30 constitutional descriptors, 44 connectivity descriptors, 35 topology descriptors, 7 Kappa descriptors, 32 Moran autocorrelation descriptors, 60 MOE-type descriptors, 21 Basak descriptors, 64 Burden descriptors, 6 Molecular property descriptors, 25 Charge descriptors and 89 E-state descriptors.


3. Feature selection

To improve accuracy scores of the QSAR models and to boost their performance on the datasets with quite many features, the feature selection process is needed. Here, we employed the randomForest algorithm to take a recursive feature elimination. According to the feature importance, we finally selected the features as follows: BP(short for 'Normal boiling point'): 38, density: 18, MP(short for 'Melting point'): 49, Solubility: 34, ST(short for 'Surface tension'): 22, Viscosity: 21, VP(short for 'Vapor pressure'): 46, FP(short for 'Flash point '): 54. The detailed information is shown as Table 1 and Figure 1.

Table 1
Data Descriptor
BP PC6, GMTIV, W, Chi8, Sitov, Gravto, Chiv8, TPSA, Save, ncarb, UI, Scar, IC6, Hy, slogPVSA9, bcutp10, dchi4, Smax, IC1, MRVSA9, KierA1, Tnc, Smin, S17, AWeight, bcutp15, bcutv12, ndonr, dchi0, Soxy, bcutv11, IVDE, Chiv2, LogP, Hatov, LogP2, dchi1, slogPVSA11
Density AWeight, MRVSA9, slogPVSA11, Hatov, Shet, Gravto, nhyd, Scar, S7, BertzCT, IC1, dchi0, CIC1, Save, TPSA, Smin, noxy, dchi4
MP TPSA, IC1, ndonr, noxy, GMTIV, dchi4, Tnc, UI, Gravto, Chi10, knotp, IVDE, Chi3c, Shet, dchi0, J, KierA3, W, bcutp2, dchi1, IC6, Kier3, CIC1, PEOEVSA0, nhet, Chiv4pc, IC2, bcutp3, knotpv, MRVSA9, Snitro, Hatov, naro, S36, Chiv10, nhyd, LogP, Save, Scar, Smin, S34, slogPVSA1, slogPVSA0, naccr, Hy, nring, CIC4, LogP2, Chiv3c
Solubility LogP, LogP2, Tsch, Gravto, TPSA, Chiv8, bcutp2, Chi10, slogPVSA1, PEOEVSA5, MRVSA9, dchi4, Hy, GMTIV, UI, naccr, naro, bcutp3, Scar, Smax, Tpc, Smin, dchi1, AWeight, Shal, nhyd, knotp, dchi0, IC1, Save, PEOEVSA0, MZM1, CIC0, Hatov
ST S7, UI, Hatov, dchi0, TPSA, IC1, BertzCT, AWeight, nhet, S12, MRVSA9, Gravto, Smax, S17, dchi3, dchi2, slogPVSA1, GMTIV, Smin, J, Save, nring
Viscosity S34, PEOEVSA0, Tac, GMTIV, ndonr, Gravto, Chiv6, PC6, TPSA, Scar, bcutp3, Shev, Chiv2, BertzCT, LogP, bcutv10, DS, IVDE, slogPVSA1, Qindex, LogP2
VP GMTIV, W, Chi10, PC6, Chiv10, Shev, Gravto, Chiv6, TPSA, Tpc, IVDE, IC1, bcutp10, ncarb, Shet, ndonr, Save, nring, MZM2, J, PEOEVSA0, nhet, S38, MZM1, LogP, Hy, LogP2, CIC6, dchi0, Scar, dchi1, bcutv11, Smin, Hatov, AWeight, Shal, MRVSA0, Smax, Chi3c, slogPVSA9, S34, bcutp16, IC2, slogPVSA2, nsb, S35
FP GMTIV, TPSA, W, Chi10, Gravto, Tnc, Chiv10, IVDE, UI, dchi4, Shet, bcutp2, IC1, ndonr, dchi0, LogP, MRVSA9, Save, noxy, dchi2, IC5, naro, J, PEOEVSA0, Smax, Chiv2, Scar, LogP2, naccr, Snitro, Smin, bcutv11, Hy, AWeight, MZM2, Hatov, Chi3c, nnitro, S17, S36, slogPVSA1, MZM1, CIC6, S7, S34, S12, CIC1, knotp, Chiv4pc, slogPVSA9, KierA3, S35, IC2, nhyd


Figure 1. The number of variables VS cross-validation error (The red line represents the number of selected descriptors)


4. Model training and validation

After the feature selection, we used the four state-of-arts machine learning algorithms to build models. In this step, we used the grid search method to optimize a best set of parameters for each model. The detailed information is shown in Table 2. As there may be several outliers which possess relatively large error, it is necessary to remove part of the molecule to improve the prediction accuracy. Here, we also tried to build models using datasets removing samples with prediction error greater than three times of RMSE.

Table 2
Data Sample Descriptors Parameters
Training/test SVM Boosting
sigma C epsilon shrinkage depth
ST 1133/283 22 -7 6 -4 0.06 8
VP 2006/504 46 -7 3 -8 0.06 8
Viscosity 444/113 21 -5 5 -3 0.01 6
BP 4607/1151 38 -8 6 -12 0.07 8
Density 7125/1783 18 -7 6 -8 0.05 7
MP 7509/1875 49 -6 3 -2 0.07 8
Solubility 4016/1004 34 -6 4 -3 0.06 8
FP 6690/1672 54 -10 6 -3 0.09 8

  Model Performance

Performance VS all data

Data Methods R2 RMSEFit Q2 RMSECV RT2 RMSET
ST Cubist 0.951 1.45 0.868 2.38 0.908 2.07
RF 0.972 1.11 0.837 2.65 0.88 2.36
SVM 0.946 1.53 0.895 2.12 0.868 2.48
Boosting 0.992 0.589 0.863 2.43 0.912 2.03
VP Cubist 0.975 0.546 0.938 0.855 0.947 0.814
RF 0.987 0.39 0.922 0.959 0.931 0.933
SVM 0.974 0.559 0.945 0.809 0.941 0.862
Boosting 0.995 0.239 0.938 0.855 0.943 0.845
Viscosity Cubist 0.916 0.166 0.683 0.322 0.832 0.231
RF 0.962 0.112 0.8 0.256 0.805 0.248
SVM 0.963 0.111 0.881 0.198 0.846 0.22
Boosting 0.93 0.151 0.79 0.262 0.789 0.258
BP Cubist 0.973 13.9 0.942 20.4 0.94 20.9
RF 0.986 10.2 0.914 24.8 0.917 24.5
SVM 0.955 18.1 0.936 21.5 0.929 22.7
Boosting 0.988 9.31 0.94 20.8 0.934 21.9
Density Cubist 0.966 0.0601 0.955 0.0696 0.963 0.0614
RF 0.988 0.0363 0.933 0.0847 0.947 0.0737
SVM 0.965 0.061 0.957 0.0676 0.967 0.0582
Boosting 0.975 0.0514 0.946 00761 0.955 0.0676
MP Cubist 0.878 34.4 0.827 40.9 0.842 40.1
RF 0.968 17.7 0.815 42.3 0.832 41.3
SVM 0.885 33.4 0.82 41.8 0.835 41
Boosting 0.93 26 0.834 40.1 0.842 40.1
Solubility Cubist 0.907 0.675 0.853 0.846 0.849 0.855
RF 0.974 0.354 0.852 0.85 0.846 0.865
SVM 0.917 0.636 0.851 0.852 0.859 0.826
Boosting 0.962 0.431 0.859 0.83 0.853 0.844
FP Cubist 0.921 23.4 0.871 29.8 0.874 29.1
RF 0.975 13 0.855 31.6 0.864 30.1
SVM 0.896 26.7 0.875 29.4 0.872 29.2
Boosting 0.972 14 0.872 29.7 0.872 29.2


Performance VS removing outliers

Data Methods R2 RMSEFit Q2 RMSECV RT2 RMSET
ST Cubist 0.972 1.050 0.929 1.657 0.936 1.720
RF 0.985 0.765 0.910 1.858 0.914 1.954
SVM 0.973 1.038 0.941 1.631 0.933 1.745
Boosting 0.995 0.457 0.934 1.601 0.935 1.719
VP Cubist 0.981 0.458 0.961 0.660 0.965 0.661
RF 0.991 0.314 0.947 0.764 0.951 0.773
SVM 0.983 0.440 0.966 0.621 0.963 0.675
Boosting 0.996 0.215 0.960 0.668 0.959 0.705
Viscosity Cubist 0.940 0.132 0.862 0.198 0.871 0.179
RF 0.974 0.085 0.855 0.201 0.851 0.208
SVM 0.982 0.076 0.924 0.156 0.896 0.179
Boosting 0.958 0.111 0.860 0.199 0.828 0.207
BP Cubist 0.984 10.660 0.967 15.241 0.967 15.187
RF 0.991 8.106 0.943 19.826 0.944 19.752
SVM 0.976 13.251 0.964 15.952 0.964 16.181
Boosting 0.991 8.225 0.963 15.980 0.964 15.922
Density Cubist 0.987 0.036 0.982 0.042 0.984 0.040
RF 0.995 0.022 0.967 0.056 0.971 0.051
SVM 0.989 0.034 0.985 0.038 0.986 0.037
Boosting 0.987 0.037 0.975 0.049 0.975 0.048
MP Cubist 0.902 30.367 0.861 35.969 0.873 35.151
RF 0.974 15.516 0.849 37.279 0.862 36.459
SVM 0.912 28.786 0.863 35.819 0.873 35.404
Boosting 0.942 23.526 0.866 35.398 0.873 35.339
Solubility Cubist 0.930 0.577 0.891 0.718 0.894 0.712
RF 0.981 0.299 0.891 0.717 0.891 0.717
SVM 0.944 0.520 0.899 0.691 0.900 0.687
Boosting 0.968 0.390 0.895 0.704 0.900 0.695
FP Cubist 0.952 17.612 0.924 21.788 0.928 20.610
RF 0.985 9.470 0.908 23.582 0.913 22.544
SVM 0.937 20.289 0.920 22.333 0.927 21.321
Boosting 0.981 11.504 0.923 21.928 0.924 21.725


Predicted values VS experimental values

Note: the following plots are the predicted values VS experimental values for ST, VP, Viscosity, BP, Density, MP, Solubility and FP respectively (removing outliers).








Visits since Aug 30, 2016

The recommended browsers: Safari, Firefox, Chrome,IE(Ver.>8).
 E-mail: biomed@csu.edu.cn