RUS  ENG
Full version
JOURNALS // Matematicheskaya Biologiya i Bioinformatika // Archive

Mat. Biolog. Bioinform., 2025 Volume 20, Issue 2, Pages 320–347 (Mi mbb596)

Bioinformatics

An algorithm for computation of $P$-value of patterns occurrences in a random sequence. Overlap graph minimization

E. I. Furletova

Institute of Mathematical Problems of Biology RAS, Pushchino, Moskovskaya obl.

Abstract: We have developed algorithm MinSufPref for computing of the probability ($P$-value) of $S$ occurrences of words from a set $\mathcal{H}$ in a random sequence of length $N$. It is widely used in computer sciences as a criterion for selecting of patterns of a special kind. In particular, in bioinformatics, when searching for functional fragments in biological sequences. The proposed algorithm is based on the previously developed algorithm SufPref for exact $P$-value computation. SufPref uses an overlap graph for calculations, the nodes of the graph correspond to the words from $\mathcal{H}$ and their overlaps. The restrictions of SufPref are that it admits only sets consisting of words of a same length, and also the algorithm may not be applicable for sets consisting of a large number of words. In this paper, the algorithm is expanded to sets consisting of words of arbitrary lengths, but with the constraint that a set can not contain words occurring in other its words. Also the $\overset{R}\sim$ -equivalence relation on the prefixes of pattern words is introduced, the overlap graph is minimized in accordance to this relation. The space and time complexities of the main part of MinSufPref are linear on the number of nodes and edges of the minimized overlap graph. The time complexity is independent of the alphabet size and the lengths of words in $\mathcal{H}$. We have performed computer experiments that showed high efficiency of MinSufPref compared to similar algorithms.

Key words: $P$-value, patterns occurrences, an overlap, Aho-Corasick automaton, automaton minimization.

Received 21.06.2025, 27.07.2025, Published 12.08.2025

DOI: 10.17537/2025.20.320



Bibliographic databases:


© Steklov Math. Inst. of RAS, 2026