Articles
Notes from production
Specifics from real builds — tail-latency, data-quality testing, and the governance that lets AI-assisted development into an enterprise. No think-piece filler.
Stop your RAG system hallucinating
Most RAG hallucinations are retrieval failures, not generation failures. Diagnose which, ground answers in cited context, make the model abstain, and track faithfulness.
ReadSelf-hosting open models vs an API: where the cost actually crosses over
Self-hosting open-weight models beats API pricing at high steady throughput or under data-residency rules; APIs win for spiky, low-volume, or frontier-quality work.
ReadModel routing: stop sending every request to your biggest model
Most LLM traffic doesn't need a frontier model. Route by rules, a classifier, or a cascade to cut spend several-fold without silently degrading quality.
ReadLLM-as-a-judge: evaluating LLM systems that actually scale
How to use LLM-as-a-judge to evaluate generative systems at scale: rubrics, golden sets, bias mitigation, human calibration with Cohen's kappa, and CI gates.
ReadMigrating off tag proliferation to branch/environment CI/CD in GitLab Ultimate
Replace tag-driven releases with a branch/environment promotion model in GitLab Ultimate: protected branches, approval gates, LDAP/AD identity, and build-once-promote pipelines.
ReadEnterprise Claude Code without leaking your code
Run Claude Code via Amazon Bedrock with VPC endpoints so prompts and code stay in your AWS account: IAM-scoped, no public egress, no training on your data.
ReadCutting vision-LLM cost 70-90% with motion-gating
A cheap OpenCV motion-gating layer in front of a vision-language model cuts 24/7 surveillance API cost 70-90%, as built in my dvr_ai project.
ReadYour first dbt tests
Three low-effort dbt tests catch roughly 80% of warehouse errors. Add not_null, unique and accepted_values on your keys and enums, wire them into CI, and bad data stops reaching dashboards.
ReadCutting p95 latency without new hardware
A metric-driven playbook that routinely trims 40%+ off tail latency before you reach for a bigger instance — measure, relieve the DB, control concurrency, shed load.
ReadWorking on one of these problems?
Book a 15-minute call — it's faster than reading all nine.