Async Inference
Async inference lets you use the familiar OpenAI-compatible API to make LLM requests that are automatically deferred from real-time to high-priority asynchronous processing. The result is significant cost savings with minimal changes to your workflow.
This is powered by the Autobatcher — a Python client that collects your individual API calls and submits them as optimized async requests behind the scenes.
Why Async Inference?
- OpenAI-compatible — Uses the same
openaiSDK and API format you already know - Drop-in cost savings — Switch your base URL and API key; your existing code works as-is
- Priority processing — Requests use a 1-hour SLA, balancing cost and speed
- No JSONL files — Unlike batch inference, you don't need to prepare input files
When to use Async Inference
Async inference is the right choice when your application makes LLM calls that don't need to resolve in real-time. Common use cases include:
- Agentic workflows — Multi-step agent systems where individual steps can be processed asynchronously
- Background processing — Content generation, summarization, or classification that runs behind a queue
- Development and testing — Running evaluations or prompt iterations where you don't need instant feedback
- Cost optimization — Any existing OpenAI integration where you want to reduce spend without refactoring
Quick Start
1. Install the Autobatcher
pip install autobatcher2. Create an API Key
Generate a key from the Doubleword Console, or sign in above to auto-populate the code examples.
3. Use it like OpenAI
from autobatcher import AsyncOpenAI
client = AsyncOpenAI(api_key="{{apiKey}}")
response = await client.chat.completions.create(
model="{{selectedModel.id}}",
messages=[{"role": "user", "content": "Explain quantum computing"}],
)
print(response.choices[0].message.content)import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'https://api.doubleword.ai/v1',
apiKey: '{{apiKey}}'
});
const response = await client.chat.completions.create({
model: '{{selectedModel.id}}',
messages: [
{ role: 'user', content: 'Explain quantum computing' }
]
});
console.log(response.choices[0].message.content);curl https://api.doubleword.ai/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer {{apiKey}}" \
-d '{
"model": "{{selectedModel.id}}",
"messages": [
{"role": "user", "content": "Explain quantum computing"}
]
}'The Autobatcher automatically collects requests and submits them in optimized batches. Your code receives standard ChatCompletion responses — no changes needed to downstream logic.
How It Works
- You make API calls using the familiar OpenAI interface
- The Autobatcher collects requests over a short time window (default: 1 second)
- Collected requests are submitted as a high-priority async batch
- Results are polled and returned to your waiting callers as they complete
- Your code receives standard ChatCompletion responses
For full configuration options, see the Autobatcher reference.