Indian Sign Language Recognition using CNN

This project explored Indian Sign Language (ISL) recognition using convolutional neural networks, aiming to support accessibility through computer vision.

Key Contributions

Dataset preparation and preprocessing
CNN-based gesture classification
Focus on real-world usability and accuracy

Broader Impact

Beyond technical performance, the project emphasized inclusive AI, showing how machine learning can be applied to socially meaningful problems.

Implementation

Object Detection Overview

Object detection is a computer vision task used to detect instances of semantic objects (of a given class) in digital images and videos. Unlike classification-only approaches, object detection returns both a class label and a bounding box for each detected instance.

YOLO: You Only Look Once

YOLO is a state-of-the-art, real-time object detection system. It treats detection as a single regression problem, directly from image pixels to bounding box coordinates and class probabilities, making it computationally efficient compared to multi-stage detectors.

Approach for Indian Sign Language (ISL)

Traditional hand-crafted features often fail to generalize across all signs, especially as the symbol set grows. To address this, object detection (particularly Tiny YOLO) was introduced for ISL recognition to:

reduce computation compared to heavier models
improve robustness across varied backgrounds and conditions

Common detection models include R-CNN, Faster R-CNN, and SSD. While these are effective, they typically require more computation or multi-stage pipelines. Tiny YOLO provides a strong balance of speed and accuracy for this use case.

Prior Methodology and Limitations

An initial threshold-based method was used for sign segmentation, but it failed under varying lighting conditions. The updated pipeline uses Tiny YOLO trained on a diverse dataset to handle different positions, backgrounds, and illumination.

Annotation and Dataset

Total images: ~14,000 ISL samples collected from multiple users, backgrounds, and positions.
Annotation tool: labelImg (Python + Qt), which saves annotations in PASCAL VOC (XML) format, consistent with ImageNet-style datasets.
Input size: 288×288×3; each image includes bounding boxes around the hands/sign regions.

Screenshots: annotation examples referenced as “Screenshot (119)” and “Screenshot (120)” in the original notes.

Training Environment

Hardware: Intel Core i5, NVIDIA GeForce 920MX (CUDA support)
Frameworks: Keras with TensorFlow backend (CPU/GPU variants available)
Training strategy: mini-batch training to fit memory constraints

Data Considerations

Real-time robustness depends heavily on dataset quality and quantity. For detection models:

Collect large, diverse samples covering variations in pose, background, and lighting.
Label each instance with precise bounding boxes and class names.
Maintain consistent annotation format (e.g., PASCAL VOC) for tooling compatibility.

Computation Notes

Deep learning models are computationally intensive; GPU acceleration is recommended. CPU-only training is possible but significantly slower (often days). For deployment, Tiny YOLO offers practical real-time performance on modest hardware.