When Self-Hosting LLMs Becomes a Bottleneck

Lessons from Production-Grade AI Infrastructure

March 18, 2026

Emiliano Frigo

Thought Leadership

The Problem

At Muttdata, we built an AI-powered system to generate sensitive legal documents. The pipeline was complex by design:

100–120 LLM calls per run
5–8 different models (extraction, generation, formatters, fallbacks, judges, guardrails)
Long-context RAG (PDF extraction, FAISS retrieval, multi-stage reasoning)
Highly sensitive legal data protected by attorney–client privilege

Stringent privacy and regulatory requirements forbid sending client data to external APIs, so self-hosting large language models was the only viable option.

This decision wasn’t about cost savings; it was driven by the need to maintain data sovereignty and security inside our private infrastructure, since third-party cloud LLMs couldn’t guarantee compliance.

We chose Ollama to serve models on AWS EC2 (g5.12xlarge with 4× A10G GPUs). It offered dynamic model switching, policy-flexible model support (critical for abuse-related legal content), and a simple API for multi-model orchestration.

What followed were months of instability:

A single stalled model call would freeze the entire event-driven pipeline. We spent more time fighting infrastructure than shipping product features.

The Original Challenge: Ollama in Production

We initially tried to deploy Ollama directly on AWS EC2 g5 instances for our legal writing pipeline. Through a systematic stress test—including monitoring GPU memory in CloudWatch, tuning context windows (2K to 128K), testing multiple models (Llama, Mistral, Qwen), and measuring hang rates across 100+ runs—we discovered significant architectural limits.

Ollama quirks: Inconsistent model unloading caused critical memory lockups, which forced us to implement subprocess-based timeouts to prevent indefinite stalls.

The overarching lesson:
The core issue wasn’t language-model quality—it was infrastructure reliability under sustained production constraints. Production LLM serving exposes architectural limits that often stay hidden in smaller, non-autonomous workloads.

‍

The Pivot: An API-First Strategy

Recognizing that orchestration issues were slowing delivery, we shifted strategy: we shipped the feature using managed LLM APIs in a compliant setup first. This let us validate product value and stabilize a complex 10–15-call pipeline without the operational burden of GPU fragmentation and indefinite hangs.

The takeaway is simple: start with managed APIs—without compromising compliance—and optimize later.

Self-hosting GPU inference is a specialized domain. Multi-GPU setups require deep expertise in CUDA, memory architecture, and observability. For most teams, the operational overhead outweighs the benefits unless infrastructure is a core competency or the workload strictly mandates it.

The Path Back to Self-Hosting (Future Alternatives)

While managed APIs addressed the immediate stability issues, self-hosting remains our long-term goal. We’re evaluating more robust, production-grade serving frameworks for a future transition:

vLLM: A leading option for high-throughput workloads. Its PagedAttention approach and stronger NVLink support are designed to mitigate the PCIe-related performance degradation we saw previously.
llama.cpp: A strong option for tighter control and efficient resource use. Its lightweight design and mature quantization support may help address the opaque logging and unreliable unloading we experienced.

Final Thought

Experimentation matters, but timing matters more. Leadership isn’t just choosing the right tool—it’s knowing when to let a provider handle the heavy lifting so you can focus on the features that define your product.

Before committing to a permanent self-hosted stack, use managed APIs to reach “production reality” faster. Once your workload is validated, then—and only then—invest in the CUDA and memory expertise required to run engines like vLLM or llama.cpp at scale. Otherwise, you risk optimizing infrastructure for a product that hasn’t had the chance to breathe.

‍

Share article.

News & insights