Abstract:
The paper presents an overview of the stages of statistical processing of text data, from specific informational objects in databases to the values of the numerical characteristics of these objects. For example, if the database contains the descriptions of full-text research articles, then they represent specific informational objects. With the appropriate population of such a database, the multistage procedure of their processing makes it possible to determine the values of the numerical characteristics of the publication activity of a researcher, a scientific division, and a scientific organization as a whole. Such procedures begin with the processing of specific informational objects and end with computing of the values of the numerical characteristics of these objects. At intermediate stages, tables and other both verbal and numerical objects may form. If the stages of the statistical processing are designed to be reversible and the database implements the function of verifying the values of the numerical characteristics, then the procedure of their verification begins with the values of the characteristics and ends with access to specific informational objects that were used to compute these values. The paper proposes a formalized description of the stages of statistical processing of text data in databases. Informational-mathematic transformation (IM-transformation) is the proposed name for such transformation of text data into numerical values. It combines the processing of specific informational objects, the formation of verbal and numerical objects, and the mathematical computation of the values of numerical characteristics. Such transformation of text data may include mathematic processes at certain stages; however, it does not completely reverse back to them. The goal of the paper is to propose the principles of formalized description of IM-transformation of texts in databases. To illustrate this, the paper provides the example of formalizing the process of determining the frequency of translation variants of connectives expressing intertextual relations between text fragments in the supracorpora database of connectives developed in the FRC CSC RAS.
Keywords:informational-mathematic transformation, text information, statistical processing of text information, supracorpora database.