Intro to Doubleword Inference
Doubleword provides three styles of inference, each optimized for different workloads. Async and batch inference offer significant cost savings over real-time pricing by deferring processing from synchronous to asynchronous execution.
All three styles use the same OpenAI-compatible API format and share the same model catalog.
| Realtime | Async | Batch | |
|---|---|---|---|
| How it works | Standard request-response | Autobatcher defers calls to async processing | Upload JSONL file, retrieve results later |
| Latency | Immediate | Minutes | Hours |
| SLA | None | 1 hour | 1 hour or 24 hours |
| Cost | Standard pricing | Reduced pricing | Lowest pricing (24h SLA) |
| API change | None — drop-in OpenAI replacement | Swap SDK import only | JSONL file preparation |
| Best for | Interactive chat, prototyping, prompt iteration | Agentic workflows, background pipelines, production workloads | Dataset processing, evaluations, bulk generation |
Realtime Inference
Realtime inference works exactly like the standard OpenAI API — send a request, get an immediate response. It's ideal for interactive use cases, development, and prototyping.
No cost savings, but no latency trade-off either.
Get started with Realtime Inference →
Async Inference
Async inference uses the Autobatcher to automatically convert your API calls into high-priority asynchronous requests. It's a drop-in replacement for the OpenAI SDK — your existing code works with a single import change.
Because requests are deferred from real-time to async processing, you get significant cost savings while keeping the same familiar API interface.
Best suited for:
- Multi-step agentic workflows where each call doesn't need an instant response
- Background content generation and classification pipelines
- Any application code that can tolerate short async delays
- Teams migrating from OpenAI who want immediate cost savings with zero refactoring
Get started with Async Inference →
Batch Inference
Batch inference is designed for large-scale data processing workloads that run outside of your application code. You upload requests as JSONL files and retrieve results when processing is complete.
With a 24-hour SLA, batch inference offers the deepest cost savings — ideal for workloads where turnaround time is measured in hours, not seconds.
Best suited for:
- Large dataset processing and transformation
- Model evaluations and benchmarking
- Bulk content generation and classification
- Research workflows and data enrichment