New algorithms for restoring DNA matrix and their statistical study
In the proposed paper, we continue to consider various heuristics for reconstructing distance matrices between DNA sequences; as before, we prefer to consider mitochondrial DNA. In the paper, we apply new heuristics. First, every time we receive a new matrix value, we return to the previously restored elements, select the one that gives the maximum value of badness, and try to improve it based on the newly obtained elements. Secondly, when choosing the final value of an element, we select for the first approximation several values close to the optimal one (in practice, there are up to 10 of them), after which we use correlation analysis to select the closest one. We conducted computational experiments on the mitochondrial DNA of all 32 genera of monkeys. At the same time, we restored not one matrix, but 40 matrices: 10 times obtained for each of 4 different sparsity variants. We average the obtained calculation results in several different ways, in particular, by discarding the two smallest and two largest values, as well as using variants of the risk function. The results obtained indicate that in order to obtain a matrix for all types of monkeys in the future, it is desirable to have about 10–12% of the data: simplifying the situation somewhat, we can say that 7–8% gives less adequate recovery results, and, vice versa, 15% and more do not improve the value of badness (compared to 10%). At the same time, obtaining exactly such a full matrix by the algorithm for calculating DNA distance should take about 20–25 days of computer operation.