Private LLMs, RAG, and the full stack that actually makes them useful
AI has proven its worth faster than any tech in recent memory. Still, there are plenty of cases where you’d love to use it but don’t—because the data is sensitive. Your codebase. Customer PII. Anything that lives under GDPR. Sometimes you simply can’t ship that to ChatGPT. The fix: host your own ChatGPT.
This is one of our most delivered services. Subsense has set up dozens of private LLM stacks. The important bit is ROI. Despite the hype—and whatever Sam Altman is tweeting—running big models locally is expensive. A 400B-parameter monster can burn through well over $10k/month, and that’s usage-dependent: more users → more tokens → more cash out the door.
So we don’t run “one big brain.” We run a model portfolio and route by task:
- Small/fast models for classification, summarization, extraction, and routine chat.
- Medium models for planning, tool use, and more complex reasoning.
- Large models reserved for the truly hard stuff where quality pays for itself.
- VLMs (vision-language models) for image/video tasks. For example, a model like Qwen3-VL-8B-Instruct can describe images or flag littering in a scene at a fraction of the cost of a frontier model—often 10× cheaper and plenty accurate for ops workflows.
If you do need frontier-level IQ, the good news is that open-weights models are now very close to the big proprietary systems for many enterprise tasks. With the right finetunes and prompt tooling, the gap is thin to non-existent in practice.
RAG: the workhorse that turns “a model” into “your model”
RAG (Retrieval Augmented Generation) is how you keep data in your walls while getting accurate, grounded answers. No more hallucinated policy PDFs.
What we build:
- Connectors & ingestion. Git, S3/Blob, SharePoint/Drive, Jira/Confluence, CRM/ERP, ticketing, and databases. We normalize formats (PDF, HTML, Markdown, Office), parse tables/figures, and preserve permissions.
- Smart chunking & metadata. Structure-aware splitting (headings, sections, code blocks), semantic chunk sizes, and rich metadata (owner, product, region, PII flags).
- Embeddings & the index. Dense vectors (FAISS/Milvus/pgvector), optional sparse (BM25) for hybrid search, and reranking for quality.
- Query pipeline. Query rewriting, multi-hop retrieval, citations, and guardrails against prompt injection. For code and analytics, we add tool use (SQL, search, code runners) with strict sandboxes.
- Security. Row-level/ACL-aware retrieval: the model never sees what the user isn’t entitled to.
- Feedback loop. Thumbs, comments, and capture of “gold answers” into an eval set. What gets used improves what gets retrieved.
Result: grounded answers with live citations, all inside your compliance perimeter.
The self-hosted LLM/VLM platform we ship
You don’t just need a model; you need the plumbing. Here’s the stack we typically stand up:
- Inference layer. vLLM/TGI/Ollama endpoints with GPU batching, KV-cache reuse, speculative decoding, and quantization (8-/4-bit) where it makes sense. Autoscaling on Kubernetes, per-tenant isolation, blue/green and canary deploys.
- API gateway & auth. Single endpoint for chat/completions/RAG/tools, with OIDC/SSO, RBAC, rate limits, SLA tiers, and per-team token budgets.
- Prompt & model versioning. Templates live in Git; every change is tracked. Roll back bad prompts like you roll back code.
- Observability. Tracing, logs, and dashboards for latency, cost, token flow, tool calls, retrieval quality, and user satisfaction. Prompt/response diffs by version.
- Evaluation & safety. Offline eval sets for accuracy/faithfulness, protected-class and toxicity checks, jailbreak and prompt-injection detection, DLP/PII redaction on ingress and egress.
- Cost controls. Response caching, truncation/compaction, smart routing to smaller models, batch inference, max-token ceilings, and scheduled off-peak jobs.
- MLOps for LLMs. Finetunes/LoRA adapters, experiment tracking, datasets with lineage, shadow testing, and A/Bs that compare “old stack vs new stack” before flipping traffic.
- Data governance. Encryption in transit/at rest, EU data residency, retention windows, audit logs, and support for subject-access/erasure requests (hello, GDPR).
- Dev experience. SDKs, function-calling contracts, tool registries, and ready-to-use UI widgets (chat, RAG search, doc Q&A, code assistant).
- VLM specifics. OCR pipelines, image/frame sampling, face/plate redaction if required, ephemeral object storage, and content safety policies baked in.
How we keep the bill sane (without kneecapping quality)
- Right-size the model to the job. 70–80% of calls don’t need a giant.
- Cache aggressively. Deterministic prompts + citations = high cache hit rates.
- Distill where useful. Teach a smaller model your domain with supervised pairs.
- Tight prompts. Short system prompts, structured outputs, and hard token caps.
- Batch & fuse. Combine retrieval steps and parallel tool calls to cut latency and cost.
What this unlocks
- Private chat & code assistants that actually know your stack—safely.
- RAG search over policies, contracts, and tickets with live citations.
- Ops copilots that read dashboards, summarize alerts, and open tickets with the right context.
- Vision workflows: classify defects, redact images, flag safety issues, or “what’s in this scene?” checks.
- Analytics copilots: natural-language to SQL with permission-aware data access.
Bottom line: you get the upside of AI without throwing your crown-jewel data over the wall. Pick the right model for the job, ground it with your knowledge via RAG, and wrap it in tooling that keeps you fast, safe, and cost-effective.
If you want this stack running in your environment—and you want it to pay for itself—get in touch.