S. V. Znamenskij, “Approximation of the longest common subsequence length for two long random strings”, Program Systems: Theory and Applications, 2016, Volume 7, Issue 4,Pages <nobr>347

Mathematical Foundations of Programming

Approximation of the longest common subsequence length for two long random strings

S. V. Znamenskij

Ailamazyan Program Systems Institute of RAS

Abstract: The expected value $E$ of the longest common subsequence of letters in two random words is considered as a function of the $\alpha=|A|$ of alphabet and of words lengths $m$ and $n$. It is assumed that each letter independently appears at any position with equal probability.
An approximate analitic expression for $E(\alpha, m, n)$ calculation is presented that allow to calculate the $E (m, n, \alpha) $ with an accuracy of $0.3$ percent for $ 64 \leqslant m + n \leqslant 65,536 $ and $ 1 <\alpha < 129$. The coefficients are selected by hand and can be refined. It is expected that the formula holds for each grater values of the argument with the same relative error.

Key words and phrases: similarity of strings, sequence alignment, edit distance, LCS, Levenshtein metric.

UDC: 004.416

MSC: 68T37; 68P10, 68W32

Received: 24.11.2016
Accepted: 28.12.2016