Abstract:
Word vector representations are widely used in machine translation, recommender systems, and information retrieval. The quality of such representations, measured as the rank correlation with expert assessments of semantic similarity, remains limited. This paper proposes an approach to improving the quality of word vector representations by merging several independent sources of primary representations. The notions of monotone and antimonotone quadruplets of words are introduced, and the hypothesis that the information contained in monotone quadruplets allows one to recover the true order of similarities for antimonotone quadruplets is formulated and verified. A method for selecting word quadruplets, a two-step correction procedure based on a fully connected layer and a quadruplet loss function, as well as a method for evaluating the quality of the resulting representations are proposed. Experimental results on Word2Vec and GloVe models trained on a lemmatised Wikipedia corpus demonstrate the feasibility of improving representation quality when evaluated on the MEN, SimLex-999, and WordSim-353 expert datasets.