AI/ML

On-Device ML with TensorFlow Lite

How I built a contactless gesture-recognition feedback system achieving 95%+ accuracy entirely on-device. Model optimization, quantization strategies, and real-time inference on mid-range Android hardware.

2023 9 min read

The Problem: Touchless Feedback in Retail

During the post-pandemic period, a retail client needed a feedback collection system that customers could use without touching a shared screen. The requirement was simple: a tablet mounted at the store exit where customers could rate their experience using hand gestures — thumbs up, thumbs down, or a wave to skip — all processed in real-time without internet dependency.

This meant everything had to run on-device: camera capture, gesture detection, classification, and result storage. No cloud APIs, no network latency, no privacy concerns about uploading customer video to external servers.

Choosing the Right ML Approach

I evaluated three approaches:

I chose a hybrid approach: MediaPipe Hands for hand landmark detection (21 key points per hand) feeding into a custom TFLite classifier trained on our specific gesture vocabulary. This gave us the robustness of Google's hand tracking with the efficiency of a purpose-built classification model.

Model Training and Optimization

Dataset Collection

I collected gesture samples from 50 participants across varied lighting conditions, skin tones, and hand sizes. Each gesture class (thumbs up, thumbs down, open palm wave, and "no hand present" as a negative class) had approximately 2,000 samples. Data augmentation (rotation, scaling, brightness variation) expanded the effective training set by 5x.

Model Architecture

The classifier is intentionally simple: a feed-forward neural network that takes 42 normalized coordinates (21 landmarks x 2 dimensions) and outputs gesture class probabilities. No convolutions, no attention layers — just three dense layers with dropout. The simplicity is the point: inference needs to be fast enough for real-time feedback.

Quantization

The full-precision model was 1.2MB. After post-training quantization to INT8, it dropped to 340KB with less than 0.5% accuracy loss. This is important not just for storage — quantized models run significantly faster on mobile CPUs because integer arithmetic is cheaper than floating-point operations, and ARM processors have optimized instruction sets for INT8 computation.

Integration Architecture

The Android app architecture followed a pipeline pattern:

  1. Camera capture — CameraX with ImageAnalysis use case, delivering frames at 15fps to minimize battery drain while maintaining responsive gesture detection
  2. Hand detection — MediaPipe Hands processes each frame, outputting 21 hand landmarks if a hand is detected
  3. Normalization — Landmarks are normalized relative to the hand bounding box, making the classification position-invariant and scale-invariant
  4. Classification — The TFLite model classifies the normalized landmarks, outputting confidence scores per gesture class
  5. Temporal smoothing — A sliding window of 5 frames filters out spurious classifications, requiring 3/5 frame agreement before registering a gesture
  6. Feedback recording — Confirmed gestures are stored locally in a Room database with timestamps, synced to the backend during off-peak hours

Performance on Real Hardware

Target hardware was a Samsung Galaxy Tab A8 — a budget tablet representative of what retailers would actually deploy. Performance results:

Key Learnings

On-device ML is not about building the most sophisticated model. It is about building the simplest model that solves the problem within the constraints of the target hardware. The gap between "works in the lab" and "works in a retail store with fluorescent lighting and busy backgrounds" is bridged by robust preprocessing and temporal smoothing, not by model complexity.

TFLiteML KitMediaPipeModel OptimizationEdge InferenceCameraXQuantization