I have been iterating on AI-assisted programming workflows for a while now. This post is the practical stack I would use today, with links to primary docs so the guidance stays grounded.
I am deliberately optimizing for three things:
- predictable outputs,
- safe tool execution,
- measurable quality over time.
1) Start with the Responses API as the default
OpenAI positions the Responses API as the primary interface for model outputs, tool use, and multi-step interactions. For new work, this is the simplest baseline.
Source:
Minimal JavaScript shape:
import OpenAI from "openai";
const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const response = await client.responses.create({
model: "gpt-5-mini",
input: "Review this diff and list only concrete risks.",
});
console.log(response.output_text);
2) Use Structured Outputs, not plain JSON mode
If my app expects machine-readable output, I use Structured Outputs with a strict JSON schema. The docs explicitly recommend Structured Outputs over JSON mode when possible.
Source:
Example:
const result = await client.responses.create({
model: "gpt-5-mini",
input: "Classify this PR risk level and justify briefly.",
text: {
format: {
type: "json_schema",
name: "pr_risk",
strict: true,
schema: {
type: "object",
additionalProperties: false,
properties: {
risk: { type: "string", enum: ["low", "medium", "high"] },
reason: { type: "string" },
},
required: ["risk", "reason"],
},
},
},
});
3) Tool calling: keep schemas small and explicit
Function/tool calling is the bridge from text to action. The biggest reliability boost in practice is tight tool schemas and a minimal tool set per request.
Source:
Operational rules I follow:
- keep tools focused and non-overlapping,
- include enums to prevent ambiguous arguments,
- execute tool calls server-side and return outputs back to the model,
- assume there can be multiple tool calls in one response.
4) Stream for UX, but design for moderation and traceability
Streaming improves perceived latency and makes coding assistants feel responsive in editors and CLIs. OpenAI documents the event flow and common stream events.
Source:
In production, I pair streaming with:
- event logging (request id, model, tool calls, latency),
- partial-output handling,
- moderation-aware delivery paths for user-facing surfaces.
5) Make cost and latency controls first-class
Two practical controls from the docs:
- Prompt Caching for repeated long prefixes,
- model-tier routing by task complexity.
Source:
As of February 12, 2026, the pricing page lists current token rates and built-in tool pricing. These values change, so I treat the pricing page as the source of truth at implementation time.
6) Use embeddings when retrieval quality matters
For repo docs, runbooks, and internal standards, embeddings still matter. OpenAI provides text-embedding-3-small and text-embedding-3-large, with clear tradeoffs in cost and capability.
Source:
My default heuristic:
- start with
text-embedding-3-smallfor cost-sensitive search, - upgrade to
text-embedding-3-largewhen recall/semantic precision becomes a bottleneck.
7) Measure quality with Evals before expanding scope
Prompt edits feel good, but evals catch regressions. OpenAI now exposes an Evals workflow so you can define criteria and run model comparisons consistently.
Source:
A practical setup for programming assistants:
- dataset of real code-review and bug-fix prompts,
- pass/fail graders for correctness and policy adherence,
- latency/cost tracking as first-class metrics.
8) Account for rate limits and usage tiers early
This is an engineering constraint, not an ops afterthought. OpenAI documents organization/project-level limits and usage tiers.
Source:
If you build team-facing tooling, design queueing and backoff strategies from day one.
My current “production minimum” checklist
If I were shipping a coding assistant this week, I would not ship without:
- Responses API baseline,
- Structured Outputs for machine-consumed responses,
- strict tool schemas and controlled tool execution,
- streaming with traceable event logs,
- prompt caching + model routing,
- evals on representative tasks,
- explicit rate-limit and retry strategy.
This stack is not the most novel. It is the most defensible.