СЕМАНТИЧЕСКОЕ СТРУКТУРИРОВАНИЕ ДАННЫХ В ВЕКТОРНОЙ БАЗЕ: КЛАССЫ СУЩНОСТЕЙ, АТРИБУТИВНАЯ МОДЕЛЬ И ИНДЕКСАЦИЯ ДЛЯ ПОВЫШЕНИЯ РЕЛЕВАНТНОСТИ И ОБЪЯСНИМОСТИ ПОИСКА

Authors: Л. ЯКОВЛЕВ Е, Э. ХАКИМОВ Р, Е. НАЗАРЕНКО П et al.

Publication: Авиакосмическое приборостроение

Published: Feb 28, 2026

Source: Crossref

Back to Search View Original Cite This Article

Abstract

<jats:p>Рост объемов неструктурированных текстовых данных требует новых подходов к информационному поиску, так как традиционные лексические методы (TF-IDF, BM25) неэффективны при работе с синонимией и вариативностью формулировок. Цель исследования: Разработать и апробировать методологию хранения, индексации и поиска неструктурированных текстовых данных, объединяющую глубокие векторные представления семантики (трансформерные модели) и расширенные структурированные метаданные для повышения точности, полноты и объяснимости поисковой выдачи, а также обеспечения гибкой фильтрации. Методы: Системный анализ, математическое моделирование, получение векторных представлений текста (эмбеддингов) на основе трансформеров, алгоритмы построения графов приблизительного поиска ближайших соседей (HNSW), статистическая обработка результатов (MAP, Precision@k, Recall@k, F1-score, NDCG@k, T0(f T5). Результаты: Предложена онтологическая модель типизации документов и гибкая атрибутивная модель. Спроектирован прототип системы с гибридным индексом (векторные эм-беддинги + атрибуты). Эксперименты на корпусе из 10 ООО документов показали увеличение MAP на 11-15 % по сравнению с чисто векторным и лексическим поиском, снижение медианного времени отклика T50 более чем в два раза, улучшение Precision@5 на 9 %. Практическая значимость: Разработанная методология масштабируема и позволяет развертывать поисковые системы нового поколения с богатыми возможностями фильтрации и контекстно-зависимым ранжированием.</jats:p> <jats:p>The growth of unstructured text data requires new approaches to information retrieval, as traditional lexical methods (TF-IDF, BM25) are ineffective when dealing with synonymy and wording variability. Research Objective: To develop and test a methodology for storing, indexing, and searching unstructured text data that combines deep vector representations of semantics (transformer models) and extended structured metadata to improve the accuracy, completeness, and explainability of search results, as well as to enable flexible filtering. Methods: Systems analysis, mathematical modeling, vector text representations (embeddings) based on transformers, algorithms for constructing approximate nearest neighbor search (HNSW) graphs, and statistical processing of results (MAP, Precision@k, Recall@k, Fl-score, NDCG@k, T50, T95). Results: An ontological document typing model and a flexible attribute model are proposed. A prototype system with a hybrid index (vector embeddings + attributes) was developed. Experiments on a corpus of 10,000 documents showed an 11-15% increase in MAP compared to purely vector and lexical search, a more than twofold reduction in the median T50 response time, and a 9% improvement in Precision@5. Practical Relevance: The developed methodology is scalable and enables the deployment of next-generation search systems with rich filtering capabilities and context-sensitive ranking.</jats:p>

Keywords

на vector search text results

СЕМАНТИЧЕСКОЕ СТРУКТУРИРОВАНИЕ ДАННЫХ В ВЕКТОРНОЙ БАЗЕ: КЛАССЫ СУЩНОСТЕЙ, АТРИБУТИВНАЯ МОДЕЛЬ И ИНДЕКСАЦИЯ ДЛЯ ПОВЫШЕНИЯ РЕЛЕВАНТНОСТИ И ОБЪЯСНИМОСТИ ПОИСКА

Abstract

Keywords

Related Articles

Deep learning for the analysis of medical images in endoscopy: approaches to early diagnosis

Modern approaches to modeling the activities of civil aircraft pilots

Application of the Monte Carlo method to quantile optimization problems

Uncertainty as an important component of the content of modern higher education

Overview of modern reinforcement learning methods for solving problems of optimal control of dynamical systems