Back to Search View Original Cite This Article

Abstract

<jats:p>This paper addresses the problem of designing the architecture of an information system intended for detecting sensitive information in Ukrainian-language corporate documents within the context of approaches used in data loss prevention (DLP) platforms. The relevance of this study is determined by the rapid growth of unstructured textual data in corporate information systems, as well as by the increasing need to combine information security requirements with sound principles of information system design and software architecture. An additional motivating factor is the limited number of scientific studies focused on automated processing of Ukrainian-language textual content, particularly with regard to its morphological complexity, syntactic variability, and stylistic features, which significantly influence the effectiveness of natural language processing methods. The paper provides an analytical overview of the main approaches to sensitive information detection in text documents, including rule-based and pattern-matching methods, as well as contextual methods based on named entity recognition using machine learning and deep learning models. Special attention is paid to the architectural implications of integrating heterogeneous detection methods within a single information system, which is essential for ensuring comparability, reproducibility, and extensibility of further experimental studies. Based on the conducted analysis, a modular architecture of an experimental software system is proposed. The architecture is designed to provide unified conditions for text document ingestion, preprocessing, and analysis, support for multiple detection methods with different computational characteristics, and centralized aggregation and evaluation of detection results using standard quality metrics. The proposed architectural solutions rely on the principles of streaming processing of large text documents, clear separation of responsibilities between functional components, scalability, and reproducibility of analytical procedures. The architecture also defines interfaces for the potential integration of rule-based scanners and contextual NER modules based on transformer models, forming a consistent environment for future comparative analysis. The proposed architectural approach can be used as a methodological and design foundation for the subsequent implementation and validation of an experimental information system aimed at studying methods for protecting textual information resources in corporate environments, as well as for supporting further research at the intersection of information system design and information security.</jats:p>

Show More

Keywords

information system methods architecture detection

Related Articles