When the GPU Dies

2025-07-07

When the GPU Dies — Keeping Internal LLM Experiments Rolling on Decade‑Old CPUs

(A war‑story about queues, Kubernetes, and refusing to lose momentum)

1  Context & Stakes

Our team was proving out some internal LLM‑powered tools—no customers were waiting on live responses, but every lost day meant one more slip on our roadmap and one more frustrated analyst (read: me). When our single NVIDIA A100 croaked, progress ground to a halt. Requests went from taking around 2-3 seconds to >60s, beyond hard proxy limits, so requests to our Ollama service would fail even if we were willing to wait. After briefly throwing up my hands and resigning myself to losing months of progress, I began to wonder how much performance we could squeeze from our legacy hardware.

  • CPU inference on our decade‑old Xeons ballooned beyond our hard 60 second timeout ≈ 400 s per prompt.

  • Budgets & data‑sensitivity ruled out cloud GPUs or frontier‑model APIs.

    The mission: keep experiments moving without buying new hardware—ideally in a single afternoon.

2  Solution in Brief

Switch from Ollama service to a fleet of llama.cpp + lightweight python servers. Turn every inference request into a MySQL‑backed job, return a job_id instantly, and let the llama.cpp fleet chew through the backlog asynchronously.

3  High‑Level Architecture

┌──────────────┐  REST/gRPC  ┌───────┐  <100 ms   ┌──────────┐
│  Dev Tools   │────────────►│Ingress│───────────►│Producer  │
│  / Scripts   │             └───────┘            │(API shim)│
└──────────────┘                                  └──┬───────┘
                                                     │INSERT
                                               ┌─────▼─────┐
                                               │  MySQL    │
    GGUF weights on                            │  jobs     │
  shared PVC ►►►                               └────┬──────┘
                                                    │SELECT
                                               ┌────▼──────┐
                                               │ Worker    │
                                               │  Pods     │
                                               └───────────┘
  • PVC‑mounted weights — one 27‑B model (gemma3) copied once to storage class ReadWriteMany, mounted read‑only by every worker.

4  Minimal MySQL Schema

CREATE TABLE jobs (
  id          BIGINT AUTO_INCREMENT PRIMARY KEY,
  prompt      TEXT        NOT NULL,
  params      JSON        NOT NULL,
  status      ENUM('queued','running','done','error') DEFAULT 'queued',
  created_at  TIMESTAMP   DEFAULT CURRENT_TIMESTAMP,
  started_at  TIMESTAMP   NULL,
  finished_at TIMESTAMP   NULL,
  result      MEDIUMTEXT  NULL,
  error_msg   TEXT        NULL,
  INDEX(status),
  INDEX(created_at)
);

Workers claim jobs atomically with:

UPDATE jobs
SET    status = 'running', started_at = NOW()
WHERE  id = :id AND status = 'queued';

5  Performance Journey

Phase Latency to Produce One Answer Result vs. 60 s Timeout What Actually Changed?
Ollama w/ A100 ≈ 2–3 s ✅ well under GPU inference
Ollama w/ CPU Only N/A ❌ hard‑fail (504) request blocked proxy
Llama.cpp CPU Only ≈ 400 s ✅ passes (async) same compute time, but we return job_id instantly

Some back of the envelope math tells me at 400s/request I can run about 108/worker node in the 12 hours overnight. I have enough resources to run about 4 worker nodes so this gives me about 432 requests/night I can process. After running some optimizations to reduce the number of requests (like dropping ones which haven't changed or which we can deterministically solve without the LLM) I can get the number of jobs down to 500. Not ideal but not the end of the world if jobs don't finish until mid-morning. We have successfully moved from being dead in the water to being more or less fully functional!

6  Client Pattern

  1. POST /jobs with prompt & params → gets { "job_id": 123 }.
  2. Poll /jobs/123 with exponential back‑off or open a WebSocket for pushes.
  3. Retrieve result once status == "done".

7  Lessons Learned

  1. Queues buy time—enough to finish experiments on museum‑piece hardware.
  2. Boring tech scales under stress. Relatively simple setup worked well and was working well within a few days.
  3. Momentum matters. Avoiding a multi‑day freeze kept the team (and me) motivated.

8  Epilogue: Back to Wasteful Bliss

A few weeks later the A100 returned, we flipped a switch, and real‑time responses were back. I had been gearing up to layer richer, multi‑step LLM workflows on top of the queue—then blissfully shelved that todo the moment the GPU started humming again. The queue lives on as cheap insurance.