Joint Optimization of Text Detection and Recognition via a Unified CNN-LSTM Pipeline with Transformer CTC Enhanced Decoding

Pallavi Krishna Purohit

Research Scholar, Department of Computer Science Engineering, Sangam University, Rajasthan, India

Dr. Vikas Somani

Professor, Department of Computer Science Engineering, Sangam University, Rajasthan, India

Dr. Sandeep Saxena

Professor, Department of Computer Science Engineering, JIMS Engineering Management Technical Campus, Greater Noida, India

DOI : https://doi.org/10.51690/ijaas.36891

Keywords:

Connectionist Temporal Classification (CTC), Multimodal Feature Fusion, Text Recognition, Transformer Decoder, Unified Pipeline

Abstract

In this paper, we propose a novel approach for the joint optimization of text detection and recognition in a unified
deep learning framework. Leveraging the strengths of Convolutional Neural Networks (CNN), Long Short-Term
Memory (LSTM) networks, and Transformer models, we introduce an integrated pipeline that performs end-to-end
text detection and recognition within a single architecture. Our method utilizes CNNs for robust feature extraction,
LSTMs for sequential modeling, and incorporates a Transformer-based Connectionist Temporal Classification (CTC)
decoding mechanism to enhance performance in handling variable-length sequences and addressing text
recognition ambiguities. By jointly training text detection and recognition in a single pipeline, we eliminate the
need for separate post-processing or the traditional two-stage approach, leading to significant improvements in
both accuracy and efficiency. The Transformer CTC-enhanced decoding provides dynamic alignment and helps in
handling complex text variations, making the model highly adaptable to diverse datasets and challenging realworld scenarios. We demonstrate the effectiveness of our approach on standard text recognition benchmarks,
showing substantial improvements in both detection and recognition tasks when compared to existing state-ofthe-art methods. Experimental results confirm the superior performance of our model in terms of text localization,
recognition accuracy, and inference speed.

Published

2025-12-03

How to Cite

Pallavi Krishna Purohit, Dr. Vikas Somani, Dr. Sandeep Saxena , Joint Optimization of Text Detection and Recognition via a Unified CNN-LSTM Pipeline with Transformer CTC Enhanced Decoding, International Journal of Advanced and Applied Sciences, 12(12) 2025, Pages: 19-47

ISSUE

2025 Volume 12, Issue 12 (December) (2025)