Cutting vision-LLM cost 70-90% with motion-gating
TL;DR
Sending every frame to a vision-language model is accurate but ruinously expensive at 24/7 scale. Put a near-free OpenCV motion-gating layer in front, escalate only "interesting" frames to the VLM, and you cut API cost by 70-90% with almost no loss of real detections. I built exactly this in my dvr_ai retail loss-prevention project: OpenCV frame-differencing gates a Gemini-primary, Groq-fallback VLM, with an FFmpeg ring buffer producing evidence clips.
Sending every frame of a camera feed to a vision-language model works, and it is accurate. It is also a great way to set fire to a budget. A single camera at even a modest 1 frame every 2 seconds is ~43,000 VLM calls per day. Multiply by cameras and by months and the API bill alone kills the project before it ships.
The fix is boring and effective: put a near-free classical computer-vision layer in front of the expensive model, and only escalate frames that are actually worth a VLM call. In my own dvr_ai project — a real-time retail loss-prevention system that watches shop cameras for cash theft, sweethearting and unauthorised access — an OpenCV motion-gating stage cuts VLM calls by 70-90% in typical retail footage, because most of the day a till or stockroom simply has nobody moving in front of it.
Why is per-frame VLM analysis the wrong default?
Two reasons. Cost, and waste.
A shop camera pointed at a register is static for the overwhelming majority of its uptime. Out of hours it is empty; in hours there are long gaps between transactions. If you bill a VLM for every frame, you are paying premium token rates to have a frontier model repeatedly confirm that an empty counter is still empty. That is the bulk of the spend and it buys nothing.
The waste compounds. Every redundant call adds latency, rate-limit pressure and log noise, and it ties your throughput to the slowest, most expensive component in the pipeline. The model should only ever see frames where something changed.
How does the motion-gate actually work?
Classic frame-differencing in OpenCV. Convert to greyscale, blur to kill sensor noise, diff against the previous frame, threshold, dilate, and count the changed pixels. If the count crosses a threshold, escalate; otherwise sleep and move on. No machine learning, no GPU, microseconds per frame.
Here is the core of my MotionDetector — this is the whole trick, and it is cheap enough to run on a Raspberry Pi:
def _compute_motion_score(self, frame, camera_id):
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
gray = cv2.GaussianBlur(gray, (21, 21), 0) # suppress sensor noise
if camera_id not in self.previous_frames:
self.previous_frames[camera_id] = gray
return 0, gray
frame_delta = cv2.absdiff(self.previous_frames[camera_id], gray)
thresh = cv2.threshold(frame_delta, 25, 255, cv2.THRESH_BINARY)[1]
thresh = cv2.dilate(thresh, None, iterations=2)
motion_score = cv2.countNonZero(thresh) # changed-pixel count
self.previous_frames[camera_id] = gray
return motion_score, gray
# elsewhere: has_motion = motion_score > settings.motion_thresholdThe Gaussian blur matters more than it looks — without it, JPEG compression artefacts and camera ISO noise produce a constant trickle of "motion" and the gate leaks. The dilation step closes small gaps so a single moving person reads as one coherent blob rather than scattered pixels. I use a flat changed-pixel count (MOTION_THRESHOLD, default 5000) rather than contour-area logic; for a fixed camera that is simpler and just as effective, and contour area is the obvious next refinement if you need to ignore, say, a ceiling fan.
Only when the gate fires does the frame get written to JPEG and sent to the VLM. The rest of the loop is a plain asyncio sleep at a configurable interval (2 seconds by default).
Which VLM, and what happens when it falls over?
Primary is Google Gemini 2.0 Flash; fallback is Groq's Llama 3.2 90B Vision. The split is deliberate: Flash is cheap, fast and good enough for "is this person palming cash", while Groq gives me a second vendor on a different network path so a single provider outage or rate-limit spike doesn't blind the whole shop.
The failover is unconditional — any exception from the primary (timeout, 5xx, rate limit, malformed JSON) drops straight through to the other provider, with tenacity doing exponential backoff (3 attempts, 2-10s) inside each call:
try:
response = await self._analyze_with_gemini(image_path, prompt)
model_used = "gemini-2.0-flash-exp"
except Exception as e:
logger.warning("Primary VLM failed, trying fallback", error=str(e))
response = await self._analyze_with_groq(image_path, prompt)
model_used = "llama-3.2-90b-vision"The prompt asks for structured JSON — threat_level, confidence, detected_behaviors, reasoning — so the result is machine-actionable rather than prose I have to re-parse. High and critical threat levels always flag; medium needs to clear a confidence threshold (default 70); low needs 55+ to catch subtle behaviour without drowning in noise.
How do you get an evidence clip without recording everything?
A ring buffer. FFmpeg's segment muxer writes rolling fixed-length .mp4 segments and wraps, automatically overwriting the oldest — so disk usage is bounded regardless of uptime. In dvr_ai that's 30-second segments, 4 deep, copied straight off the RTSP sub-stream with no re-encode:
ffmpeg -rtsp_transport tcp -i <url> -c:v copy -an \
-f segment -segment_time 30 -segment_wrap 4 \
-segment_format mp4 -reset_timestamps 1 cam1_%03d.mp4
When the VLM confirms an event, the ClipExtractor concatenates the buffered segments and trims a window around the event timestamp (pre- and post-event seconds, configurable) into a single H.264 clip with +faststart for web playback, plus a thumbnail. That gives a manager the seconds before and after the incident — which is what actually proves intent — without ever paying to store 24/7 footage in the cloud. -an drops audio for privacy and bandwidth; the sub-stream is low-bandwidth but plenty for AI analysis.
Where do you set the sensitivity threshold?
This is the only knob that really matters, and it is a straight trade-off:
- Threshold too low → the gate leaks. Lighting shifts, shadows and compression noise trigger VLM calls on nothing, and your cost savings evaporate.
- Threshold too high → you miss real events. A quick, low-movement action (slipping a note into a pocket) might not move enough pixels to trip the gate, and a missed theft is a far worse failure than a wasted API call.
My rule: tune the gate to over-trigger slightly and let the VLM be the precise filter. A wasted Gemini Flash call costs a fraction of a penny; a missed event costs the client real money and costs me credibility. So I bias the cheap stage toward recall and lean on the expensive stage for precision. In practice the 70-90% saving comes from the long dead periods, not from being clever at the margin — you do not need an aggressive threshold to capture most of the benefit.
The downstream confidence thresholds (70 for medium, 55 for low) are the second line of defence against false positives, and because every event is logged with its frame, model and reasoning, you can review false positives and tune both the prompt and the thresholds against real footage rather than guesses.
How are alerts delivered?
A webhook. On a confirmed medium-or-higher event the AlertDispatcher POSTs a JSON payload — event type, threat level, confidence, behaviours, reasoning and a link to the evidence clip — to a configurable endpoint, again wrapped in tenacity retry with backoff. Webhook-first means the surveillance system stays decoupled from however the client wants to be notified; the receiving end fans out to whatever channel they use. Delivery success and the raw response are written back onto the event record so there's an audit trail of what fired and whether it landed.
When is this worth building, and when is it not?
Motion-gating in front of a VLM is the right call when your feed is mostly static — fixed cameras, long idle periods, occasional events of interest. Retail interiors, stockrooms, restricted doorways, equipment bays: all ideal. The gate is near-free, runs on-prem on commodity hardware, and the only data that ever leaves the building is the handful of frames that already showed motion — which matters a great deal for clients with privacy or trade-secret concerns. The whole stack is a single Docker container plus Postgres.
It is the wrong tool when the scene is always busy (a packed shop floor, a motorway), because the gate fires constantly and saves nothing — there you want a cheaper purpose-trained detector doing first-pass classification, with the VLM reserved for adjudicating ambiguous cases. It is also overkill if you genuinely need every frame analysed, e.g. high-frequency process inspection. And note the honest limit of frame-differencing: it sees change, not meaning. A person standing dead still defeats it, and headlights or a flickering screen can trip it. Those are exactly the cases the VLM is there to resolve — which is the whole point of the two-tier design.
So: cheap classical CV decides when to look, the expensive model decides what it is. That division is what makes 24/7 vision-LLM economically sane.
Related service
Computer vision & multimodal AI