060 279 5587 info@sitect.co.za 139 Davies Street, Doornfontein, Johannesburg, 2001 Gauteng, SA
Services / AI Consultation & Integration / LLM API Integration
Production-ready · Multi-provider · Live in 2–3 weeks

Plug LLMs into your stack. Without the regret.

A senior-built integration layer that sits between your app and the LLM providers — with provider routing, fallback chains, rate-limiting, prompt versioning, observability and hard cost caps. So your CTO can sleep, and your finance team doesn't get a R250k surprise on the 5th.

Live in 2–3 weeks Provider-agnostic SOC 2-style controls HTTP 200 From R7,000

Four ways teams ship LLM features and regret it on the 1st.

We've cleaned up enough of these that we built a generic fix. Sound familiar?

!

Naked SDK calls

Direct openai.create() calls sprinkled across services. No retries. No timeouts. No observability. When OpenAI hiccups, your whole feature is down.

openai.RateLimitError: ...
!

Surprise R-bills

One developer writes a recursive prompt. One Friday afternoon. By Monday, the team's R8k/mo budget hits R85k — and finance discovers it on the credit-card statement.

"What's this R 73,400 from OpenAI?"
!

Vendor lock-in

SDK calls baked into 40 files. New provider drops a better model? You quote 6 weeks to migrate. By the time you ship, that "better model" has been deprecated.

// TODO: refactor for Claude (someday)
!

Prompts in code

Your best prompt is a 280-line string literal. Edit needs a deploy. PM wants A/B. Three versions in git, none in production. Nobody trusts the analytics.

const PROMPT_V3_FINAL_FINAL = `...`;

Eight things a proper gateway gives you.

Sitect's gateway is opinionated. These eight capabilities ship with every integration — you turn each one on or off in config, you don't have to write any of them.

Provider routing

Per-route preference order with automatic fallback when the primary returns 429 / 500 / timeout.

prefer: [...]

Hard cost caps

Per-call, per-user, per-team and per-month spend ceilings. Hit the cap, the gateway throws — not the bill.

budget: 0.05

Rate-limit shield

Token-bucket per-key + per-user limiting. Burst absorbed; sustained abuse politely 429'd. Never lets the upstream limit hit your customer.

rate: 60/min

Smart cache

Semantic + exact-match caching for repeat queries. Cuts spend & latency on FAQ-style workloads by 60–80%.

cache: "semantic"

Prompt registry

Prompts versioned in Git or our UI. Edit a prompt, hot-reload without deploy. A/B test two versions. Roll back instantly.

prompt: "summary@v3"

Observability

Every call traced with token in, token out, cost, model, latency, user. Exports OpenTelemetry for Grafana / Datadog.

span.set(...)

PII redaction

Names, emails, ID numbers, phone numbers stripped before the prompt leaves your tenant. POPIA Section 19 by default.

redact: ["pii"]

Eval suite

Regression tests for prompts. Catch quality drops before customers do. Plug into CI — fail the build on a model swap that hurts your golden set.

eval run --suite

From 120 lines of boilerplate to one clean call.

Same feature. Same provider. Same guarantees. Two very different on-call experiences.

DIY · before sitect 120 LOC · 3 deps
// Direct OpenAI call with manual retries
async function summarise(text) {
  let attempt = 0;
  while (attempt < 3) {
    try {
      const r = await openai.chat.completions
        .create({ model: "gpt-4o", messages: [...] });

      // no cost tracking · no logging
      // no fallback to claude · no cache
      return r.choices[0].message.content;
    } catch (e) {
      if (e.status === 429) {
        await sleep(attempt * 1000);
        attempt++;
      } else throw e;   // dies on 500s
    }
  }
  throw new Error("out of retries");
}
// repeat ×40 across the codebase…
With sitect gateway 5 LOC · 1 dep
// All of the above — built in.
async function summarise(text) {
  return (await sitect.complete({
    prompt: "summary@v3",
    inputs: { text },
    prefer: ["claude-3.5-sonnet", "gpt-4o"],
    cache: "semantic",
    budget: 0.05,
  })).text;
}

// → Auto-retries · auto-fallback · cached · logged
// → Cost-capped · PII-redacted · OpenTelemetry-traced
// → A/B-able · versioned · eval-tested
// ✓ 200 OK · 380ms · R$0.0024
One line of import. One file changed per call-site. The other 39 features come free.

One SDK. Twelve providers. Switch with a config change.

The gateway abstracts over every major LLM provider. Same API, different model — pick by cost, latency, accuracy or data residency.

Live
AI
OpenAI
gpt-4o · o1 · gpt-4o-mini
Live
An
Anthropic
claude-3.5-sonnet · haiku
Live
G
Google
gemini-2.0 · gemini-flash
Live
B
AWS Bedrock
claude · titan · llama
Live
Az
Azure OpenAI
gpt-4o (EU/SA regions)
Live
M
Mistral
mistral-large · codestral
Live
x
Grok / xAI
grok-2
Live
C
Cohere
command-r-plus · embed
Beta
R
Replicate
open-source LLMs
Live
L
LLaMA / Ollama
on-prem · self-hosted
Live
T
Together AI
deepseek · mixtral
Beta
Hf
Hugging Face
custom endpoints
How it sits in your stack

One gateway. Twelve providers. Zero lock-in.

The gateway is a thin service you run alongside your existing app — Docker image, K8s manifest, or our managed cloud. Sits between you and the providers, takes the configuration as code.

Your app

Any language, any framework
Node · Laravel · Django Go · Rust · .NET Mobile · Edge function
⚡ Sitect gateway

One API, eight features

Auth + key vault
Per-team API keys
Rate & budget
Cap per call/user/team
Semantic cache
60–80% spend cut
Provider router
Prefer + fallback
PII redact
POPIA-compliant
Trace + log
OpenTelemetry

LLM providers

All major + self-hosted
OpenAI · Anthropic Google · AWS · Azure Mistral · Cohere · xAI
Deploys as 1 Docker image · ~80MB · Kubernetes-ready · or use our SA-region cloud. Same SDK across JS/TS · Python · PHP · Go · .NET · OpenAI-compatible drop-in mode supported.

Live spend, latency & errors — in one pane.

Every call streams metrics into the dashboard your CTO checks before the standup. No more "is the model down?" pinging in Slack. No more end-of-month bill surprises.

  • Real-time cost — see today's, this week's, this month's spend per team, per route, per user.
  • P50 / p95 / p99 latency — broken down by provider and route, alertable on drift.
  • Error budget — SLO-style burn rate; pages oncall when you're chewing through it.
  • Spend forecast — projected month-end based on the last 7 days. Catch runaway prompts on Tuesday.
  • Audit log search — find any past call by user, route, model or substring.

Six artefacts. All yours, all in your repos.

No SaaS lock-in. The gateway, the SDK, the dashboard — all open-source license to you, hosted on your infra, in your registry.

01 · Gateway service

Deployable Docker image

The core gateway. Stateless, horizontally scalable, ships with a Helm chart and Terraform module. Sub-100MB image.

ghcr.io/sitect/llm-gw
02 · Client SDK

One SDK, every language

JS/TS, Python, PHP, Go and .NET clients — typed, tested, with auto-retries and traces wired in. Drop-in OpenAI-compatible mode for zero-migration.

@sitect/llm
03 · Admin dashboard

Spend, traces & prompts

Web UI for the dashboard, prompt registry editor, A/B test config, eval runs and audit log search. Hosted on your domain.

Next.js · self-host
04 · Prompt library

Versioned, A/B-able, hot-reloaded

Prompts moved out of code and into a registry. Version control, draft/prod separation, A/B traffic split, instant rollback.

Git-backed
05 · Eval suite

Golden-set regression tests

30+ test cases tuned for your domain. Runs on every prompt change and every provider swap. CI-integrated.

npm run eval
06 · Runbook + on-call

What to do when things break

Markdown runbook for the top 12 incident types. Sitect on-call hand-off for the first 30 days, then your team owns it.

Notion + PagerDuty

2–3 weeks. 4 phases. One PR.

The final cut-over is usually a single PR your team reviews. Most of the engagement is in the audit and the architecture review — the wiring is the easy part.

01
Week 1 · 3 hrs

Audit

Map every existing LLM call-site. Identify retry-gaps, missing limits, runaway prompts, lock-in points.

02
Week 1–2 · 2 hrs

Architecture

Pick deployment shape (your K8s, your VM, our cloud). Define routes, budgets, redaction rules, prompts to migrate.

03
Week 2 · 1 hr

Wire & ship

Gateway deployed, SDK installed, call-sites migrated. Cut-over PR with feature flags so you roll in 10%/50%/100%.

04
Week 3 · 1 hr

Tune & train

Eval suite written, your team trained on the dashboard. Sitect on-call hand-off for 30 days post-launch.

The numbers behind a gateway done right.

Aggregated across our deployed gateways. Your numbers will be yours — these are the order-of-magnitude bracket.

67%
Spend reduction
Median LLM bill drop after cache + routing tuning.
99.95%
Effective uptime
Of the LLM feature — even when one provider is down.
3.4×
Dev velocity
Speed of shipping new LLM features after the gateway lands.
0
Budget surprises
Bills always inside the cap. Never get the finance call again.

Pick the scope. We deliver in 2–3 weeks.

Build fee is fixed in writing. Once shipped, you own the code — no per-call platform fee, no SaaS subscription, no lock-in. LLM provider tokens billed at provider cost direct to your accounts.

Single-provider

Starter gateway

R7,000
/project · ex 15% VAT
  • One LLM provider (OpenAI or Anthropic)
  • Retries · rate-limit · cost cap
  • 1 SDK (TS or Python)
  • Basic dashboard (spend + errors)
  • 5 prompts migrated to registry
  • 30 days post-launch support
Choose Starter →
Enterprise

Platform engagement

R36,000+
scoped per engagement · ex VAT
  • All providers + on-prem LLMs
  • SSO · audit · DSAR exports
  • All 5 SDKs
  • SLA + dedicated PM
  • Custom evals on your golden set
  • Multi-tenant key vault
  • 90 days hyper-care + QBR
Talk to us →

Common questions from CTOs.

If yours isn't here, ping us — we'll answer with a code sample, not a sales call.

Is this a SaaS we have to pay forever, or do we own the code?
You own everything. Source code, Docker images, prompt registry, dashboard — all delivered into your repos under MIT-style license. Run it on your infra. We have an optional managed cloud (SA-region) if you don't want to self-host, but it's an opt-in, not a lock-in.
How is this different from LiteLLM / OpenRouter / Helicone?
Those are excellent point tools. LiteLLM is provider routing. OpenRouter is a hosted multi-provider gateway. Helicone is observability. The Sitect gateway is opinionated and bundles all three plus PII redaction, prompt registry, eval suite and an SA-tuned dashboard. You'll see common building blocks from those tools under the hood — we don't reinvent for the sake of reinventing.
What about latency? Does the gateway add overhead?
Median overhead is ~8ms per call (auth + dispatch). Cached responses return in ~10ms total. We deploy alongside your app in the same VPC/region, so network hops are negligible. A cold start on Lambda is ~80ms but most clients run on long-lived containers.
Can we keep using the OpenAI SDK directly?
Yes — the gateway supports OpenAI's API shape as a drop-in. Point your existing openai SDK at https://gw.your-domain.com/v1 and it'll route through us with no code change. You miss out on some advanced features (prompt registry, semantic cache) until you adopt the native SDK, but you get retries, cost caps and observability for free on day one.
POPIA — does customer data leave South Africa?
It depends on the provider you pick. Azure OpenAI (EU/SA regions), AWS Bedrock (af-south-1 / eu-west-1), and self-hosted Llama/Mistral keep data in-region. OpenAI, Anthropic and Google route via US. The gateway labels each provider with its data-residency and you can pin routes to SA-only providers if your DPA requires it. PII redaction strips identifiers before any cross-border call.
How do you handle prompt injection / jailbreaks?
Three layers: (1) Input sanitisation — strip suspicious patterns before the prompt goes out; (2) Output moderation — content filter on responses, with a configurable threshold; (3) Audit log — every request retained 90 days, so you can replay any incident. Plus a kill-switch to disable any route in 1 click if something goes wrong.
What does the run cost look like in production?
The gateway itself is essentially free to run — one ~80MB container, <1 vCPU and <512MB RAM for moderate traffic. The variable cost is the LLM tokens, billed directly by your provider. Most clients see net spend drop after migration because semantic cache + provider routing more than pay for the gateway compute.
What if a provider deprecates a model mid-project?
Common headache. With the gateway it's a 2-line config change to swap the underlying model — your call-sites don't change. We also run the eval suite automatically on every model swap so you catch quality regressions before customers do.

Get a free architecture review.
You'll learn something, even if you don't hire us.

Drop your repo (or a screenshare of your current LLM call-sites). One of our seniors will spend 30 minutes finding the 3 biggest risks in your setup — gaps in retries, cost-runaway prompts, missing redaction, vendor lock-in. Written follow-up within 48h. No sales pitch.

What you'll get in the review

  • Top 3 risks in your current setup
  • Indicative spend savings range
  • Suggested provider stack for SA
  • Written 1-page summary

More AI Consultation services

AI Strategy & Roadmap

Where does AI fit in your business?

AI Chatbot Integration

Smart chatbots that capture, qualify and convert.

AI Voice & Phone Agents

24/7 AI agents that handle calls.