Joint Optimization of Text Detection and Recognition via a Unified CNN-LSTM Pipeline with Transformer CTC Enhanced Decoding
Connectionist Temporal Classification (CTC), Multimodal Feature Fusion, Text Recognition, Transformer Decoder, Unified Pipeline
Abstract
In this paper, we propose a novel approach for the joint optimization of text detection and recognition in a unified
deep learning framework. Leveraging the strengths of Convolutional Neural Networks (CNN), Long Short-Term
Memory (LSTM) networks, and Transformer models, we introduce an integrated pipeline that performs end-to-end
text detection and recognition within a single architecture. Our method utilizes CNNs for robust feature extraction,
LSTMs for sequential modeling, and incorporates a Transformer-based Connectionist Temporal Classification (CTC)
decoding mechanism to enhance performance in handling variable-length sequences and addressing text
recognition ambiguities. By jointly training text detection and recognition in a single pipeline, we eliminate the
need for separate post-processing or the traditional two-stage approach, leading to significant improvements in
both accuracy and efficiency. The Transformer CTC-enhanced decoding provides dynamic alignment and helps in
handling complex text variations, making the model highly adaptable to diverse datasets and challenging realworld scenarios. We demonstrate the effectiveness of our approach on standard text recognition benchmarks,
showing substantial improvements in both detection and recognition tasks when compared to existing state-ofthe-art methods. Experimental results confirm the superior performance of our model in terms of text localization,
recognition accuracy, and inference speed.
Published
How to Cite
Pallavi Krishna Purohit, Dr. Vikas Somani, Dr. Sandeep Saxena , Joint Optimization of Text Detection and Recognition via a Unified CNN-LSTM Pipeline with Transformer CTC Enhanced Decoding, International Journal of Advanced and Applied Sciences, 12(12) 2025, Pages: 19-47

