Platform Engineering
This page is for the platform lead deciding where Keel sits in the request path, which deployment shape to start with, and how to roll out across services already carrying meaningful AI traffic.
It assumes you already know the basic mechanism from How It Works and the permit concept in the docs. The question here is narrower: where does the decision step live, what topology should own it, and how do you keep rollback under control as rollout expands.
Where Keel Sits In The Request Path
Keel sits on the synchronous step between application intent and provider execution. Your service declares workload, tenant, model intent, and token estimates; Keel answers with a signed permit or a structured block; only then does the provider call run.
- The application sends request metadata to `keel.request()` before the provider SDK call.
- Keel evaluates policy, budget scope, allowed providers, and exception paths for that workload.
- The service either executes with the permit attached or stops with a structured block reason.
- Closeout records what actually ran so platform, finance, and security read the same request record later.
That placement keeps prompt construction, retries, and product logic in the application while moving the approval step into one shared runtime seam.
Deployment Topology Choice
The topology decision is about ownership and rollout cost, not marketing labels. All three shapes use the same permit model; the difference is where integration seams and payload handling live.
| Topology | Choose it when | Tradeoff you own |
|---|---|---|
| SDK | You control the service code, want metadata-only evaluation by default, and need prompt bodies to stay inside your infrastructure. | Each governed service adds a request and record seam. Rollout is code-first, but the datapath stays narrow. |
| Gateway | You need governed coverage for workloads you cannot instrument yet or want one ingress point for provider calls. | Payloads transit Keel, so retention mode and residency questions move into procurement earlier. |
| Mixed | Some services can integrate directly now while others need proxy coverage first. | Platform owns two operating shapes for a while, but policy and permits still land in one record set. |
Integration And Rollout Sequence
- Week 1 — choose one high-spend or high-scrutiny workload, define the metadata contract, and measure the added request-path latency. In-region permit evaluation runs about 8 ms p50 and 25 ms p99 in SDK mode; gateway mode usually adds another 10-15 ms before provider transit.
- Weeks 2-4 — add the remaining services on the same workload family, wire closeout, and make exception paths explicit so rollback does not depend on Slack coordination.
- Steady state — platform owns policy shape, topology defaults, and route constraints; service teams still own prompt logic, downstream retries, and application-specific failure handling.
The real seam count matters operationally: SDK mode distributes the integration work across governed services, while gateway mode centralizes it at the proxy and shifts more of the runtime contract onto platform.
Failure Modes And Rollback Posture
| Failure mode | Default posture | Platform implication |
|---|---|---|
| Keel edge unavailable | Audit-evidence workloads default fail-closed; cost-only workloads can be configured fail-open with deferred reconciliation. | Choose the posture per workload before rollout. Do not leave it to incident improvisation. |
| Policy evaluation error | Fail-closed. | A bad rule deploy blocks with a structured reason instead of silently passing traffic through. |
| Provider error after permit issuance | The request records provider_error on the same permit_id. | Rollback can change provider route or disable a workload without losing the earlier decision record. |
| Kill switch needed | Block at the workload or exception-path level in policy. | The fastest rollback is a policy change, not patching each service independently. |
Who To Forward This To Next
Security & Compliance
Forward this when procurement wants direct answers on encryption, residency, retention, SSO/SCIM, and NDA artifact handling.
Finance & Ops
Forward this when finance needs reconciliation, chargeback, and close-cycle handling, not another topology explainer.
For the platform decision call, use the contact page and bring the current request path plus an architecture diagram. The useful next step here is a technical walkthrough, not a generic product demo.