Services / AI Consultation & Integration / LLM API Integration

Production-ready · Multi-provider · Live in 2–3 weeks

Plug LLMs into your stack. Without the regret.

A senior-built integration layer that sits between your app and the LLM providers — with provider routing, fallback chains, rate-limiting, prompt versioning, observability and hard cost caps. So your CTO can sleep, and your finance team doesn't get a R250k surprise on the 5th.

Live in 2–3 weeks Provider-agnostic SOC 2-style controls HTTP 200 From R7,000

Get a quote Free architecture review

app/summarise.ts

Live · 12 req/s

import { sitect } from "@sitect/llm";

// Route, retry, log, cap-spend — all handled.
const summary = await sitect.complete({
  prompt: "Summarise the support ticket below…",
  prefer: ["claude-3.5-sonnet", "gpt-4o"],
  fallback: "gpt-4o-mini",
  cache: "hash",
  budget: 0.05,   // max R$ per call
});

// → 200 OK · 380ms · R$0.0024 · claude-3.5-sonnet
return summary.text;

Latency p95428ms

Spend MTDR 12,340

Errors0.02%

Why DIY LLM integrations break

Four ways teams ship LLM features and regret it on the 1st.

We've cleaned up enough of these that we built a generic fix. Sound familiar?

Naked SDK calls

Direct openai.create() calls sprinkled across services. No retries. No timeouts. No observability. When OpenAI hiccups, your whole feature is down.

openai.RateLimitError: ...

Surprise R-bills

One developer writes a recursive prompt. One Friday afternoon. By Monday, the team's R8k/mo budget hits R85k — and finance discovers it on the credit-card statement.

"What's this R 73,400 from OpenAI?"

Vendor lock-in

SDK calls baked into 40 files. New provider drops a better model? You quote 6 weeks to migrate. By the time you ship, that "better model" has been deprecated.

// TODO: refactor for Claude (someday)

Prompts in code

Your best prompt is a 280-line string literal. Edit needs a deploy. PM wants A/B. Three versions in git, none in production. Nobody trusts the analytics.

const PROMPT_V3_FINAL_FINAL = `...`;

What sits between your app and the LLM

Eight things a proper gateway gives you.

Sitect's gateway is opinionated. These eight capabilities ship with every integration — you turn each one on or off in config, you don't have to write any of them.

Provider routing

Per-route preference order with automatic fallback when the primary returns 429 / 500 / timeout.

prefer: [...]

Hard cost caps

Per-call, per-user, per-team and per-month spend ceilings. Hit the cap, the gateway throws — not the bill.

budget: 0.05

Rate-limit shield

Token-bucket per-key + per-user limiting. Burst absorbed; sustained abuse politely 429'd. Never lets the upstream limit hit your customer.

rate: 60/min

Smart cache

Semantic + exact-match caching for repeat queries. Cuts spend & latency on FAQ-style workloads by 60–80%.

cache: "semantic"

Prompt registry

Prompts versioned in Git or our UI. Edit a prompt, hot-reload without deploy. A/B test two versions. Roll back instantly.

prompt: "summary@v3"

Observability

Every call traced with token in, token out, cost, model, latency, user. Exports OpenTelemetry for Grafana / Datadog.

span.set(...)

PII redaction

Names, emails, ID numbers, phone numbers stripped before the prompt leaves your tenant. POPIA Section 19 by default.

redact: ["pii"]

Eval suite

Regression tests for prompts. Catch quality drops before customers do. Plug into CI — fail the build on a model swap that hurts your golden set.

eval run --suite

Before · after

From 120 lines of boilerplate to one clean call.

Same feature. Same provider. Same guarantees. Two very different on-call experiences.

DIY · before sitect 120 LOC · 3 deps

// Direct OpenAI call with manual retries
async function summarise(text) {
  let attempt = 0;
  while (attempt < 3) {
    try {
      const r = await openai.chat.completions
        .create({ model: "gpt-4o", messages: [...] });

      // no cost tracking · no logging
      // no fallback to claude · no cache
      return r.choices[0].message.content;
    } catch (e) {
      if (e.status === 429) {
        await sleep(attempt * 1000);
        attempt++;
      } else throw e;   // dies on 500s
    }
  }
  throw new Error("out of retries");
}
// repeat ×40 across the codebase…

With sitect gateway 5 LOC · 1 dep

// All of the above — built in.
async function summarise(text) {
  return (await sitect.complete({
    prompt: "summary@v3",
    inputs: { text },
    prefer: ["claude-3.5-sonnet", "gpt-4o"],
    cache: "semantic",
    budget: 0.05,
  })).text;
}

// → Auto-retries · auto-fallback · cached · logged
// → Cost-capped · PII-redacted · OpenTelemetry-traced
// → A/B-able · versioned · eval-tested
// ✓ 200 OK · 380ms · R$0.0024

One line of import. One file changed per call-site. The other 39 features come free.

Every major provider

One SDK. Twelve providers. Switch with a config change.

The gateway abstracts over every major LLM provider. Same API, different model — pick by cost, latency, accuracy or data residency.

Live

OpenAI

gpt-4o · o1 · gpt-4o-mini

Live

Anthropic

claude-3.5-sonnet · haiku

Live

Google

gemini-2.0 · gemini-flash

Live

AWS Bedrock

claude · titan · llama

Live

Azure OpenAI

gpt-4o (EU/SA regions)

Live

Mistral

mistral-large · codestral

Live

Grok / xAI

grok-2

Live

Cohere

command-r-plus · embed

Beta

Replicate

open-source LLMs

Live

LLaMA / Ollama

on-prem · self-hosted

Live

Together AI

deepseek · mixtral

Beta

Hugging Face

custom endpoints

How it sits in your stack

One gateway. Twelve providers. Zero lock-in.

The gateway is a thin service you run alongside your existing app — Docker image, K8s manifest, or our managed cloud. Sits between you and the providers, takes the configuration as code.

Your app

Any language, any framework

Node · Laravel · Django Go · Rust · .NET Mobile · Edge function

⚡ Sitect gateway

One API, eight features

Auth + key vault

Per-team API keys

Rate & budget

Cap per call/user/team

Semantic cache

60–80% spend cut

Provider router

Prefer + fallback

PII redact

POPIA-compliant

Trace + log

OpenTelemetry

LLM providers

All major + self-hosted

OpenAI · Anthropic Google · AWS · Azure Mistral · Cohere · xAI

Deploys as 1 Docker image · ~80MB · Kubernetes-ready · or use our SA-region cloud. Same SDK across JS/TS · Python · PHP · Go · .NET · OpenAI-compatible drop-in mode supported.

Know what your LLM is doing

Live spend, latency & errors — in one pane.

Every call streams metrics into the dashboard your CTO checks before the standup. No more "is the model down?" pinging in Slack. No more end-of-month bill surprises.

Real-time cost — see today's, this week's, this month's spend per team, per route, per user.
P50 / p95 / p99 latency — broken down by provider and route, alertable on drift.
Error budget — SLO-style burn rate; pages oncall when you're chewing through it.
Spend forecast — projected month-end based on the last 7 days. Catch runaway prompts on Tuesday.
Audit log search — find any past call by user, route, model or substring.

LLM gateway · production

last refresh · 2026-05-17 14:23 SAST

healthy

Spend MTDR 12,340+ 6.2% vs forecast

p95 latency428ms↓ 12% w/w

Error rate0.02%2 fallbacks/min

Requests · last 24hpeak 18:30 · 412 rps

claude-3.5-sonnet62% · 420ms

gpt-4o28% · 510ms

gemini-flash9% · 380ms · degraded

gpt-4o-mini · cache1% · 12ms

What's in the build

Six artefacts. All yours, all in your repos.

No SaaS lock-in. The gateway, the SDK, the dashboard — all open-source license to you, hosted on your infra, in your registry.

01 · Gateway service

Deployable Docker image

The core gateway. Stateless, horizontally scalable, ships with a Helm chart and Terraform module. Sub-100MB image.

ghcr.io/sitect/llm-gw

02 · Client SDK

One SDK, every language

JS/TS, Python, PHP, Go and .NET clients — typed, tested, with auto-retries and traces wired in. Drop-in OpenAI-compatible mode for zero-migration.

@sitect/llm

03 · Admin dashboard

Spend, traces & prompts

Web UI for the dashboard, prompt registry editor, A/B test config, eval runs and audit log search. Hosted on your domain.

Next.js · self-host

04 · Prompt library

Versioned, A/B-able, hot-reloaded

Prompts moved out of code and into a registry. Version control, draft/prod separation, A/B traffic split, instant rollback.

Git-backed

05 · Eval suite

Golden-set regression tests

30+ test cases tuned for your domain. Runs on every prompt change and every provider swap. CI-integrated.

npm run eval

06 · Runbook + on-call

What to do when things break

Markdown runbook for the top 12 incident types. Sitect on-call hand-off for the first 30 days, then your team owns it.

Notion + PagerDuty

From kick-off to first call

2–3 weeks. 4 phases. One PR.

The final cut-over is usually a single PR your team reviews. Most of the engagement is in the audit and the architecture review — the wiring is the easy part.

Week 1 · 3 hrs

Audit

Map every existing LLM call-site. Identify retry-gaps, missing limits, runaway prompts, lock-in points.

Week 1–2 · 2 hrs

Architecture

Pick deployment shape (your K8s, your VM, our cloud). Define routes, budgets, redaction rules, prompts to migrate.

Week 2 · 1 hr

Wire & ship

Gateway deployed, SDK installed, call-sites migrated. Cut-over PR with feature flags so you roll in 10%/50%/100%.

Week 3 · 1 hr

Tune & train

Eval suite written, your team trained on the dashboard. Sitect on-call hand-off for 30 days post-launch.

Three tiers · Fixed build · Self-hosted

Pick the scope. We deliver in 2–3 weeks.

Build fee is fixed in writing. Once shipped, you own the code — no per-call platform fee, no SaaS subscription, no lock-in. LLM provider tokens billed at provider cost direct to your accounts.

Single-provider

Starter gateway

^R7,000

/project · ex 15% VAT

One LLM provider (OpenAI or Anthropic)
Retries · rate-limit · cost cap
1 SDK (TS or Python)
Basic dashboard (spend + errors)
5 prompts migrated to registry
30 days post-launch support

Choose Starter →

Multi-provider gateway

^R17,000

per project · ex 15% VAT

Up to 4 LLM providers
Fallback + provider routing
Semantic cache + eval suite
2 SDKs (TS, Python, PHP, Go — pick 2)
Full dashboard + audit log
Up to 20 prompts migrated
60 days post-launch + runbook

Choose Multi-provider →

Enterprise

Platform engagement

^R36,000+

scoped per engagement · ex VAT

All providers + on-prem LLMs
SSO · audit · DSAR exports
All 5 SDKs
SLA + dedicated PM
Custom evals on your golden set
Multi-tenant key vault
90 days hyper-care + QBR

Talk to us →

Before you book

Common questions from CTOs.

If yours isn't here, ping us — we'll answer with a code sample, not a sales call.

Is this a SaaS we have to pay forever, or do we own the code?

You own everything. Source code, Docker images, prompt registry, dashboard — all delivered into your repos under MIT-style license. Run it on your infra. We have an optional managed cloud (SA-region) if you don't want to self-host, but it's an opt-in, not a lock-in.

How is this different from LiteLLM / OpenRouter / Helicone?

Those are excellent point tools. LiteLLM is provider routing. OpenRouter is a hosted multi-provider gateway. Helicone is observability. The Sitect gateway is opinionated and bundles all three plus PII redaction, prompt registry, eval suite and an SA-tuned dashboard. You'll see common building blocks from those tools under the hood — we don't reinvent for the sake of reinventing.

What about latency? Does the gateway add overhead?

Median overhead is ~8ms per call (auth + dispatch). Cached responses return in ~10ms total. We deploy alongside your app in the same VPC/region, so network hops are negligible. A cold start on Lambda is ~80ms but most clients run on long-lived containers.

Can we keep using the OpenAI SDK directly?

Yes — the gateway supports OpenAI's API shape as a drop-in. Point your existing openai SDK at https://gw.your-domain.com/v1 and it'll route through us with no code change. You miss out on some advanced features (prompt registry, semantic cache) until you adopt the native SDK, but you get retries, cost caps and observability for free on day one.

POPIA — does customer data leave South Africa?

It depends on the provider you pick. Azure OpenAI (EU/SA regions), AWS Bedrock (af-south-1 / eu-west-1), and self-hosted Llama/Mistral keep data in-region. OpenAI, Anthropic and Google route via US. The gateway labels each provider with its data-residency and you can pin routes to SA-only providers if your DPA requires it. PII redaction strips identifiers before any cross-border call.

How do you handle prompt injection / jailbreaks?

Three layers: (1) Input sanitisation — strip suspicious patterns before the prompt goes out; (2) Output moderation — content filter on responses, with a configurable threshold; (3) Audit log — every request retained 90 days, so you can replay any incident. Plus a kill-switch to disable any route in 1 click if something goes wrong.

What does the run cost look like in production?

The gateway itself is essentially free to run — one ~80MB container, <1 vCPU and <512MB RAM for moderate traffic. The variable cost is the LLM tokens, billed directly by your provider. Most clients see net spend drop after migration because semantic cache + provider routing more than pay for the gateway compute.

What if a provider deprecates a model mid-project?

Common headache. With the gateway it's a 2-line config change to swap the underlying model — your call-sites don't change. We also run the eval suite automatically on every model swap so you catch quality regressions before customers do.

Plug LLMs into your stack. Without the regret.

Four ways teams ship LLM features and regret it on the 1st.

Naked SDK calls

Surprise R-bills

Vendor lock-in

Prompts in code

Eight things a proper gateway gives you.

Provider routing

Hard cost caps

Rate-limit shield

Smart cache

Prompt registry

Observability

PII redaction

Eval suite

From 120 lines of boilerplate to one clean call.

One SDK. Twelve providers. Switch with a config change.

One gateway. Twelve providers. Zero lock-in.

Your app

One API, eight features

LLM providers

Live spend, latency & errors — in one pane.

Six artefacts. All yours, all in your repos.

Deployable Docker image

One SDK, every language

Spend, traces & prompts

Versioned, A/B-able, hot-reloaded

Golden-set regression tests

What to do when things break

2–3 weeks. 4 phases. One PR.

Audit

Architecture

Wire & ship

Tune & train

The numbers behind a gateway done right.

Pick the scope. We deliver in 2–3 weeks.

Starter gateway

Multi-provider gateway

Platform engagement

Common questions from CTOs.

Get a free architecture review. You'll learn something, even if you don't hire us.

What you'll get in the review

More AI Consultation services

AI Strategy & Roadmap

AI Chatbot Integration

AI Voice & Phone Agents

Get a free architecture review.
You'll learn something, even if you don't hire us.