Back to Search View Original Cite This Article

Abstract

<jats:p> We address the problem of structural decomposition of complex multi-column Kazakh language newspaper pages prior to optical character recognition. We propose a hybrid, fully interpretable layout-aware pipeline named X-Cut++, which combines adaptive binarization, smoothed horizontal/vertical projection profiles, morphological dilation, colour-aware region detection in HSV space, a probabilistic Hough fallback for separator lines, and a rule-based post-OCR structural parser that reconstructs the canonical title/abstract/author/body article structure. The method is formulated as a cascade of one-dimensional projection cuts with recursive vertical and horizontal subdivision constrained by geometric and area-based thresholds, ensuring deterministic and reproducible segmentation. Experiments on a multi-issue dataset of the newspaper Egemen Qazaqstan (5 issues, Jan–Feb 2024, 72 editorial pages, 300 DPI) demonstrate that X-Cut++ consistently decomposes full pages into coherent article-level fragments. The system produces 230 fragments in total (3.19 per page on average). On a manually verified subset of 15 fragments, the structural parser achieves perfect extraction of titles and abstracts, and correctly identifies all present author lines, confirming the reliability of the post-OCR structural reconstruction.</jats:p>

Show More

Keywords

structural pages fragments newspaper xcut

Related Articles

PORE

About

Connect