Async Inference | Doubleword Inference API

Async inference lets you make LLM requests at reduced cost by relaxing latency requirements. It balances latency and throughput — faster turnaround than batch, higher throughput than realtime — and results are available via polling.

The request flow mirrors realtime background mode; the one difference is service_tier: "flex", which trades minutes-scale latency for a lower rate.

Why Async Inference?

OpenAI-compatible — Uses the standard openai SDK with the Open Responses API
Lower cost — Async requests are priced below realtime, above batch
No JSONL files — Unlike batch inference, you make standard API calls
Background or blocking — Return immediately with a response ID, or hold the connection until complete

When to use Async Inference

Async inference is the right choice when your application makes LLM calls that don't need to resolve instantly. Common use cases include:

Agentic workflows — Multi-step agent systems where individual steps can be processed asynchronously
Background processing — Content generation, summarization, or classification running behind a queue
Development and testing — Running evaluations or prompt iterations where you don't need instant feedback
Cost optimization — Any workload that can tolerate a short asynchronous delay in exchange for lower cost

Quick Start

1. Create an API Key

Generate a key from the Doubleword Console, or sign in above to auto-populate the code examples.

2. Submit a request with `service_tier: "flex"`

from openai import OpenAI
from time import sleep

client = OpenAI(
    base_url="https://api.doubleword.ai/v1",
    api_key="{{apiKey}}"
)

# Submit an async request — returns immediately with status "queued"
resp = client.responses.create(
    model="{{selectedModel.id}}",
    input="Explain the theory of relativity in detail.",
    service_tier="flex",
    background=True,
)

print(f"Queued: {resp.id} (status: {resp.status})")

# Poll until the inference service completes it
while resp.status in ("queued", "in_progress"):
    sleep(2)
    resp = client.responses.retrieve(resp.id)
    print(f"Status: {resp.status}")

print(f"
Output:
{resp.output_text}")

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'https://api.doubleword.ai/v1',
  apiKey: '{{apiKey}}'
});

// Submit an async request
const resp = await client.responses.create({
  model: '{{selectedModel.id}}',
  input: 'Explain the theory of relativity in detail.',
  service_tier: 'flex',
  background: true,
});

console.log(`Queued: ${resp.id} (status: ${resp.status})`);

// Poll until complete
let result = resp;
while (['queued', 'in_progress'].includes(result.status)) {
  await new Promise(r => setTimeout(r, 2000));
  result = await client.responses.retrieve(result.id);
  console.log(`Status: ${result.status}`);
}

console.log(`
Output:
${result.output_text}`);

Blocking mode

If you prefer to hold the connection until the result is ready, omit background. This is best for short waits — if a request may take minutes, prefer background mode to avoid connection timeouts.

from openai import OpenAI

client = OpenAI(
    base_url="https://api.doubleword.ai/v1",
    api_key="{{apiKey}}"
)

# Blocks until the async request completes
resp = client.responses.create(
    model="{{selectedModel.id}}",
    input="Summarize the history of artificial intelligence.",
    service_tier="flex",
)

print(resp.output_text)

How It Works

You submit a request with service_tier: "flex" via the Responses API
Doubleword queues it for asynchronous processing
The request is queued and processed by the inference service
Results are available via GET /v1/responses/{id} or by polling
Your code receives a standard Open Responses API response object

Using Autobatcher

For existing Chat Completions code, the Autobatcher can automatically run your realtime calls asynchronously — no code changes required beyond configuration.

from autobatcher import AsyncOpenAI

client = AsyncOpenAI(
    api_key="{{apiKey}}",
    base_url="https://api.doubleword.ai/v1"
)

# Looks like a normal OpenAI call, but runs asynchronously
response = await client.chat.completions.create(
    model="{{selectedModel.id}}",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
)

print(response.choices[0].message.content)

Next Steps

Realtime Inference — instant responses with service_tier: "priority"
Batch Inference — lowest cost for bulk workloads
Autobatcher reference — drop-in async for existing code