Comparative analyzes of plant and viral genome length distributions via generalized statistics.
DNA, Non-additive Statistics, Cucurbits, Bayesian Inference.
This doctoral thesis explores the statistical analysis of length distributions in plant and viral genomes, aiming to understand underlying statistical patterns and correlations. We propose a wide range of models derived from the generalized statistics of Tsallis and Kaniadakis: $q$-exponential, sum of $q$-exponentials, $q$-Gaussian, $q$-Weibull, $\kappa$-exponential, sum of $\kappa$-exponentials, and $\kappa$-Maxwellian. Bayesian inference and AIC and BIC criteria are used to identify the models that best explain the behavior of the analyzed genetic sequences. Initially, we studied length distributions associated with introns and exons of two plant species belonging to the \textit{Cucurbitaceae} family, namely, \textit{Cucumis sativus and Cucumis melo}. In this case, we tested adjustments for $q$-exponential functions and the sum of $q$-exponential functions, with the latter proving superior. The values found for the entropic index $q$ for all chromosomes of both species were 1.28 $\pm$ 0.06 for introns and 1.06 $\pm$ 0.13 for exons. We expanded this investigation using Kaniadakis statistics, involving three more Cucurbit species: \textit{Cucurbita maxima}, \textit{Cucurbita moschata}, and \textit{Cucurbita pepo}. The $\kappa$-exponential, sum of $\kappa$-exponentials, and $\kappa$-Maxwellian models were tested, and the sum of $\kappa$-exponentials proved superior, considering sequences of exons and introns. The values of the entropic index $\kappa$ for the analyzed species fall within the range $(0.35 \pm 0.08)$. We expanded the database to 23 plant species from 7 different families and tested the viability of the proposed models to explain the length distributions of plant proteins. The $q$-Gaussian and $\kappa$-Maxwellian functions were superior, presenting values of $q$ and $\kappa$ in the same range for all species investigated: $q_g$ = 1.28(4) and $\kappa$ = 0.38(4). These functions also proved efficient in explaining the behavior of protein length distributions in 25 viral species, belonging to the \textit{Flaviviridae} and \textit{Coronaviridae} families. We identified the possible existence of biological information, present in DNA chains, capable of distinguishing between plants and viruses.