A new approach in topological descriptors usage. Iterated line graphs in the theoretical prediction of physico-chemical properties of saturated hydrocarbons

A new look on the problem of the molecular systems index description is presented. The capabilities of iterated line (edge) graphs in characterization of saturated hydrocarbons properties were investigated. It was demonstrated that single selected molecular (graph-theoretical (topological) or informational) descriptor calculated for the sequence of nested line graphs provides quite reliable progressive set of regression equations. Hence, the problem of descriptor set reduction is solved in the presented approach at list partially. Corresponding program complex (QUASAR) has been implemented with Python 3 program language. As the test example physico-chemical properties of octane isomers have been chosen. Among the properties under investigation there are boiling point, critical temperature, critical pressure, enthalpy of vaporization, enthalpy of formation, surface tension and viscosity. The corresponding rather simple linear regression equations which include one, two or three parameters correspondingly have been obtained. The predictive ability of the equations has been investigated using internal validation tests. The test by leave-one-out (LOO) validation and Y‑scrambling evaluate the obtained equations as adequate. For instance, for the regression model for boiling point the best equation characterizes by determination coefficients R2 = 0.943, with LOO procedure – Q2 = 0.918, while for the Y-scrambling test Q2y-scr<0.3 basically.

It is shown that all the abovementioned molecular properties in iterated line graph approach can be effectively described by commonly used topological indices. Namely almost every randomly selected topological index can give adequate equation. Effectiveness is demonstrated on the example of Zagreb group indices. Also essential effectiveness and rather universal applicability of the so-called “forgotten” index (ZM3) was demonstrated.


Introduction
Development and investigation of new materials are strongly connected with building of corresponding mathematical models for target properties. Such a model can be based on either rigorous physical conception (e.g. quantum theory, statistical physics) or statistical (chemoinformatics) interpretation of available experimental data. The latter is usually based on the formal description of the molecular structure with large numbers of molecular parameters -descriptors. Such parameters describe different aspects of molecular system. Among them the physico-chemical data (lipophilicity, refractivity, etc) or pure mathematical values which are not connected directly with observed molecular properties. The subsequent usage of wide arsenal of statistical and mathematical methods provides possibility to obtain corresponding equations for prediction of desired properties or make a classification of molecular system according to certain criterion. In general, such tasks designated by widely known acronym QSAR -quantity structure-activity relationships (QSPR -quantity structure-property relationships) [1,2].
The central QSAR problem is the selection of minimal set of descriptors which guarantee reliable (adequate) description of desired properties. Nowadays for such selection it is worth mentioning factor analysis and different methods based on regularization technique: LASSO -Least Absolute Selection and Shrinkage Operator, LARS -Least Angle Regression and Shrinkage, etc. [3]. However, today the problem of suitable descriptors selection is still the one of the biggest problems in QSAR. The problem is essentially connected with the large number of available descriptors. For instance, in the popular computational QSAR software DRAGON [4] there are more then five thousand descriptors.

39
During the long history of QSAR investigations, the large set of so-called topological descriptors (TDs) based at chemical graph theory and information theory has been developed. Since the first Wiener index [5] (1947 !) thousands of indices were proposed for description of different molecular properties for various classes of chemical compounds. Among them popular Randić index [6] and corresponding set of generalizations [7], large number of theoretically-informational indices [8,9], so-called Zagreb group indices [10,11] etc. For general description of TDs see reviews [8,9,12].
It should be emphasized that development of brand new TDs is connected not only with pure mathematical fancy but also with necessity to investigate the properties which cannot be described by using compact (small enough) set of known descriptors. For instance, important hydrocarbon property -octane numbers (ON) still cannot be described with rather simple, low parametric equation based on TDs.
Recently we proposed a new graph-theoretical approach based on nested chains of line graphs [13]. Namely we build regular molecular (vertex) graph ( (0) G ), then we build line (edge) graph *) ( (1) G ), and then a sequence of graphs where each next graph is line graph for the previous one ( (2) G , (3) G , …, ( ) N G ). Subsequent calculation of chosen descriptor for vertex and all line graphs of molecule forming a predictor set for regression analysis. Effectiveness of our approach was demonstrated in description of ON for saturated hydrocarbons. In the presented article we continue the investigation of line graph concept in QSAR/QSPR modeling. As the example we use the series of different physico-chemical properties of octane isomers [14].

Line graphs in regression model building
According to our approach, for the molecular system with vertex graph (0) G the iterative construction of a line graph sequence can be described symbolically in the following way: ...
Here by V(mol) we designate procedure of molecular vertex graph building, while corresponds to building of line graph from previous one.
Adjacency matrix for such a sequence can be easily calculated using well-known matrix expression: where 1 k A  is adjacency matrix for graph ( 1) k G  , k B -is the incidence matrix of current ( ) k G graph, and I is the identity matrix.
As an example the graph sequence for methyl derivatives of cyclopropane is presented in Figure 1. From the picture one can notice that while first line graph ( (1) G ) describes the connections between edges through the vertexes, the next line graph ( (2) G ) describes participation of vertices in connection between edges, etc. For the first employment of the line graphs in chemometrics see [15][16][17].
In the present article we use the sequence of graphs for building QSAR models namely regression equations. In particular the regression model of target property (Y) according to our approach is represented with the equation of the following form: where (0) X , (1) X , (2) X , … are values of the selected descriptor for graphs (0) G , (1) G , (2) G , … correspondently, n is the number of parameters.

*)
In contemporary mathematical literature for the edge graph there are several terms. Among them the covering graph, the edge-to-vertex dual, the interchange graph, the adjoint graph, etc. In the present article we are using probably most popular among them -the line graph. To evaluate the prognostic ability of obtained equations we use standard coefficients of determination 2 R and corresponding values obtained via well known leave-one-out (LOO) cross-validation pro- where i y is experimental, ˆi y calculated and /i i y predicted values via LOO procedure for i th molecule. y is mean value for the training sample.
In the present article we describe regression models of different properties for the set of octane isomers as a test example. The rather small set of isomer molecules (18 molecules, see appendix Table A1) is difficult problem for regression model building. We choose to study seven properties of the isomers ( Table 1).
As the descriptor set we use more than 30 indices from different types of topological and informational descriptors. From the large set of obtained successful equations here we will describe the results given by several selected indices (Table 2). Especially the group of Zagreb indices attracted our attention since the latter systematically use vertex degree graph concept. Among them the so-called "forgotten" index, 3 ZM , which is of great interest in contemporary literature [11]. For all indices the set of equations (6) has been obtained for n = 0, 1, 2 (due to small training set the value n is restricted to two). Information about predictive ability of regression equations based on ZM1, ZM2, and ZM3 indices is collected in the Table 3. As an alternative to Eqs (9-12, 14) we demonstrate below several additional equations which are the best, based on LPRS and Qind (see Table 2) along with corresponding determination coefficients. 6 ( 0 ) It should be stressed that in most cases only complete chain of descriptors shows satisfactory equation, i.e. if n = 2 and for example (1) G is eliminated, the given solution will result in worse prognostic ability. Corresponding determination coefficients for the model based at ZM3 index are given in the Table 4.
As soon as the training set is quite small we can not select a set for testing. Another way to evaluate effectiveness of obtained equations (except LOO procedure) is internal Y-scrambling test. This test is based on the random permutations of Y-column (target property) without corresponding transposition of predictors. Comparison of determination coefficients from Table 3 (and Table 4) with those obtained via Y-scrambling test gives information about causality effects in the regression models. For the above mentioned models (7-13) we obtained pretty close results for Y-scrambling test [18]. Thus we will not describe it completely but only for single general case. Namely for the Eq. (7) (see also corresponding row in Table 3) thousand times Y-scrambling procedure was performed. In case when n = 1, 98. 8 Table 4. Demonstration of determination coefficients decay when incomplete sequence of graphs is employed for ZM3 index example.

Conclusion
In the present article we demonstrated ability of iterated line graphs approach in building of QSAR regression equations. In contrary to standard approach, where the combination of descriptors has to be generated by different statistical approaches (like factor analysis, etc), we use simple stepwise approach for single selected descriptor. Subsequently we calculate the nested line graphs, the chosen descriptor for it, and then corresponding regression equation. The simple comparison of determination coefficients allows to identify the best equation, and evaluate its prognostic ability.
Another aspect of the article concerned to so-called "forgotten" index, 3 ZM . It was observed before that usually 3 ZM , can not give good predictive ability itself however in combination with other indices it can give quite adequate equation [11,19]. We argued with this issue and demonstrated that 3 ZM calculated for a sequence vertex and line graphs gives regression equation with good yet rather universal predictivity.