The Problem: Touchless Feedback in Retail
During the post-pandemic period, a retail client needed a feedback collection system that customers could use without touching a shared screen. The requirement was simple: a tablet mounted at the store exit where customers could rate their experience using hand gestures — thumbs up, thumbs down, or a wave to skip — all processed in real-time without internet dependency.
This meant everything had to run on-device: camera capture, gesture detection, classification, and result storage. No cloud APIs, no network latency, no privacy concerns about uploading customer video to external servers.
Choosing the Right ML Approach
I evaluated three approaches:
- MediaPipe Hands — Google's pre-built hand tracking solution. Excellent for hand landmark detection but required building custom gesture classification on top
- ML Kit Pose Detection — Included hand detection but was optimized for full-body pose, adding unnecessary overhead for our single-hand use case
- Custom TFLite model — A purpose-built model trained specifically for our gesture set, optimized for size and inference speed
I chose a hybrid approach: MediaPipe Hands for hand landmark detection (21 key points per hand) feeding into a custom TFLite classifier trained on our specific gesture vocabulary. This gave us the robustness of Google's hand tracking with the efficiency of a purpose-built classification model.
Model Training and Optimization
Dataset Collection
I collected gesture samples from 50 participants across varied lighting conditions, skin tones, and hand sizes. Each gesture class (thumbs up, thumbs down, open palm wave, and "no hand present" as a negative class) had approximately 2,000 samples. Data augmentation (rotation, scaling, brightness variation) expanded the effective training set by 5x.
Model Architecture
The classifier is intentionally simple: a feed-forward neural network that takes 42 normalized coordinates (21 landmarks x 2 dimensions) and outputs gesture class probabilities. No convolutions, no attention layers — just three dense layers with dropout. The simplicity is the point: inference needs to be fast enough for real-time feedback.
Quantization
The full-precision model was 1.2MB. After post-training quantization to INT8, it dropped to 340KB with less than 0.5% accuracy loss. This is important not just for storage — quantized models run significantly faster on mobile CPUs because integer arithmetic is cheaper than floating-point operations, and ARM processors have optimized instruction sets for INT8 computation.
Integration Architecture
The Android app architecture followed a pipeline pattern:
- Camera capture — CameraX with ImageAnalysis use case, delivering frames at 15fps to minimize battery drain while maintaining responsive gesture detection
- Hand detection — MediaPipe Hands processes each frame, outputting 21 hand landmarks if a hand is detected
- Normalization — Landmarks are normalized relative to the hand bounding box, making the classification position-invariant and scale-invariant
- Classification — The TFLite model classifies the normalized landmarks, outputting confidence scores per gesture class
- Temporal smoothing — A sliding window of 5 frames filters out spurious classifications, requiring 3/5 frame agreement before registering a gesture
- Feedback recording — Confirmed gestures are stored locally in a Room database with timestamps, synced to the backend during off-peak hours
Performance on Real Hardware
Target hardware was a Samsung Galaxy Tab A8 — a budget tablet representative of what retailers would actually deploy. Performance results:
- End-to-end pipeline latency: 65ms per frame (capture to classification)
- Classification accuracy: 95.3% on the test set, 93.8% in real-world deployment conditions
- Model size: 340KB (quantized TFLite model)
- Battery impact: less than 5% per hour during active use
- Feedback submission rate increased by 35% compared to the previous touch-based system
Key Learnings
On-device ML is not about building the most sophisticated model. It is about building the simplest model that solves the problem within the constraints of the target hardware. The gap between "works in the lab" and "works in a retail store with fluorescent lighting and busy backgrounds" is bridged by robust preprocessing and temporal smoothing, not by model complexity.