Abstract:
Owing to rapid growth of data on viral genomes in the result of metagenomic researches, bioinformatics and virology are increasingly interacting. There is even the term viral informatics, implying the existence of a whole complex of the databases, knowledge databases about the viruses and software tools for working with them. Among the problems of bioinformatics in virology, it was earlier pointed out to annotation of viral genomes. In the present work on the example of recognizing of subgenus and genus of the coronaviruses a fairly simple and effective typological approach to virus annotation is proposed which uses frequency characteristics of the codons in individual genes. Typological approach is characterized by averaging known data, in particular, such codon frequency characteristics, to determine the similarity or resemblance with them of analogical characteristics for object under consideration. Recognition of subgenus and genus is based on statistics that reveals deviation of coronavirus gene considered from corresponding gene of viral genome with known genus or subgenus. The work compares recognition based on structural genes encoding virion proteins (nucleocapsid protein N and spike protein S) and genes of non-structural proteins combined into a single reading frame ORF1ab. Four typological approaches were discussed in the article. In the first two averaging of all available data and data on prototypical strains only was done over the genera. In the third approach original data on prototype strains were averaged over the subgenera. The fourth approach was based on individual frequency characteristics of prototype strains of the subgenera. Three of the four typological approaches revealed high efficiency in recognizing genus and subgenus of the coronaviruses while using N-gene. The fourth approach proved to be the most effective for identifying genus and subgenus of the coronaviruses. In addition, it has made it possible to reduce the number of codons considered in N-gene of the coronaviruses and to increase recognition efficiency to almost 100%.
Key words:recognizing genus and subgenus of coronaviruses, the genome of coronavirus, prototype strains of the coronaviruses, S-, N- genes of coronavirus, ORF1ab of coronavirus.
Received 21.11.2024, 24.12.2024, Published 08.01.2025