RUS  ENG
Full version
JOURNALS // Chebyshevskii Sbornik // Archive

Chebyshevskii Sb., 2019 Volume 20, Issue 2, Pages 93–107 (Mi cheb755)

New approach to searching for string median and visualization of string clusters

D. V. Gorbachev, E. P. Ofitserov

Tula State University (Tula)

Abstract: We consider the following problem of searching the median of a string set:
$$ c=\mathrm{argmin}_{a\in U}\sum_{b\in D}d(a,b), $$
where $D\subset G^{*}$ is the finite set of strings over the alphabet $G$, $U\subseteq G^{*}$, $d$ is the Levenshtein edit distance. This problem has important applications, e.g., in bioinformatics in the analysis of protein sequences. However, it is known that in the general case for $U=G^{*}$ the median problem is NP-hard. Therefore, for an approximate solution, heuristics algorithms were proposed, in particular, the greedy algorithm. We give a new flexible approach based on smooth Levenshtein distance approximation $\tilde{d}$. It uses stochastic character encoding of sequences and the following formula for the edit distance:
$$ d(X_{1},X_{2})=\min_{(X_{1}',X_{2}'')_{e}\subseteq X_{1}\times X_{2}} \Bigl\{\frac{1}{2}\,\|X_{1}'-X_{2}''\|_{1}+|X_{1}|-|X_{1}'|+ |X_{2}|-|X_{2}''|\Bigr\}, $$
where the minimum is taken over all subsequences $(X_{1}',X_{2}'')_{e}$ of equal length. On the one hand, stochastic coding extends the class, over which the extremum is searched. However, our main result shows that the median is the same. On the other hand, now we can use smooth optimization methods, replacing the minimum in the definition above with a smooth approximation. We give implementation details based on the gradient descent, including the recursion formulas for calculating $\tilde{d}$. Effective calculation of the approximate median allows, e.g., to use the $k$-means method for string clustering. We propose an approach to visualize such clusters based on tSNE stochastic nesting method.

Keywords: Levenshtein edit distance, string median, smooth minimum, gradient descent, $k$-means, string cluster visualization.

UDC: 519.72

Received: 18.05.2019
Accepted: 12.07.2019

DOI: 10.22405/2226-8383-2018-20-2-93-107



© Steklov Math. Inst. of RAS, 2025