LLMs in Production: What Nobody Tells You About Integrating AI

The demo always works.

You build a prototype, the output looks impressive, your investors or customers are excited, and you start thinking about shipping it. Then you hit production and things get complicated in ways that the prototype completely obscured.

I've shipped AI features for enough startups now to have a clear picture of where things break. This post is everything I wish someone had told me before I learned it the hard way.

Latency Is a Feature Decision

The first thing teams discover in production is that AI responses are slow. Not always, but often enough that latency becomes a product problem.

GPT-4o and Claude Sonnet are fast by large model standards, but "fast by large model standards" can still mean 3-8 seconds for a complex prompt. For some use cases, that's fine. For others, it destroys the user experience.

The mistake is treating latency as purely a technical problem to solve. It's a product decision. Before you write a line of code, ask: what's the acceptable latency for this feature? What will users be doing while they wait? Is this synchronous (user is waiting) or asynchronous (results delivered later)?

Streaming is the most important UX improvement you can make for synchronous AI features. Don't wait to have the complete response before displaying anything. Stream tokens as they arrive. A streaming response that starts immediately and completes in 6 seconds feels dramatically faster than a response that appears all at once after 4 seconds. Implement streaming from day one.

For longer operations, design for async from the start. Let users initiate a task, do something else, and come back to results. Fighting for sub-second latency on a genuinely complex AI operation is often the wrong battle.

Prompt Engineering Is Engineering

The second misconception: that prompting is informal. You write some instructions in plain English, it works well enough, ship it.

This is how you end up with a feature that works 80% of the time — and that remaining 20% becomes your #1 support ticket category.

Good prompts are engineering artifacts. They should be:

Versioned. Your prompt is as important as your code. When it changes, you should know exactly what changed, why, and what the effect was. Use a git-tracked file or a prompt management system. Never change prompts directly in a dashboard without tracking.

Tested. You need an eval suite — a set of representative inputs with expected outputs that you run whenever you change a prompt. Yes, this is manual work to set up. Yes, it saves you from shipping regressions. Even 20-30 test cases is dramatically better than nothing.

Specific about failure modes. Don't just tell the model what to do — tell it what not to do. What should it say if it doesn't know? What should it refuse? What format should it use when data is missing? The defaults are often wrong.

Separated by concern. Long, monolithic prompts that try to do everything are hard to debug and improve. Break complex operations into stages. Each stage should have a clear, single purpose.

The Consistency Problem

LLMs are not deterministic. The same input can produce different outputs. Usually similar, sometimes importantly different.

This is not a bug — it's a fundamental property of how these models work. But it creates production problems that prototype developers never see:

Your structured output extraction (JSON, tables, categorizations) will occasionally return malformed output. You need validation and retry logic. Don't assume the model will return valid JSON just because you asked for it. Parse it, catch exceptions, and have a fallback.

Your carefully evaluated prompts will produce edge-case outputs that you didn't anticipate. You'll find them in production. You need logging — not just errors, but actual AI input/output pairs, so you can analyze failure modes and improve. Log everything. Storage is cheap.

Certain user inputs will systematically trip up your model. This is your product's long tail. Users with unusual names, unusual formats, edge-case data. You'll discover these in production. Build feedback mechanisms so you can capture failures and learn from them.

Context Window Math

Here's something that bites teams surprisingly often: they design a feature that works perfectly on typical inputs, then a power user with 3 years of chat history or a 500-page document breaks it in production.

Context windows are finite. As of mid-2025, even large context models have limits, and hitting those limits produces failures. Plan for:

What's the maximum realistic size of input this feature will receive?
What happens when input exceeds the context window?
How do you chunk, summarize, or truncate in a way that preserves correctness?

For document analysis, chunking strategy is critical and often underdesigned. Naive chunking (splitting every N characters) breaks semantic units. Semantic chunking (splitting on section boundaries, paragraphs, sentence groups) takes more work but produces dramatically better results.

Cost Is a First-Class Concern

I've seen AI features that work great in production but aren't economically viable at scale. The math that looked fine on paper breaks when you account for actual usage patterns.

Get in the habit of modeling cost from the start:

Input tokens × output tokens × price per token × calls per user per month × number of users = monthly AI spend.

Do that math for your pessimistic case. Then ask: is this sustainable? At what usage level does it become a problem?

Mitigation strategies that work:

Caching. If the same question gets asked repeatedly (common in support and FAQ use cases), cache responses. Even a simple hash-based cache on common queries can cut costs 30-60%.

Model tiering. Not every query needs your most capable model. Simple classification tasks, short-form responses, structured extraction from well-formatted data — these often work just as well with faster, cheaper models. Route complexity appropriately.

Prompt optimization. Long system prompts cost money on every call. Audit your prompts for verbosity. Concise prompts that achieve the same results are cheaper and often faster.

User-level limits. Set sensible defaults on how much a single user can call AI features. Both for cost protection and to prevent misuse.

Hallucination Is a Product Problem

LLMs hallucinate. They state incorrect things with confidence. For many applications, this is the #1 user trust problem.

The answer is not "use a more accurate model." All current models hallucinate to some degree. The answer is product design.

Ground your model. RAG (retrieval-augmented generation) is the most impactful technique for reducing hallucination in knowledge-intensive tasks. Instead of relying on the model's training data, retrieve relevant, accurate information and include it in the prompt. The model then reasons over known-correct information rather than its parametric memory.

Tell the model it's okay not to know. Explicitly. "If you don't know the answer, say 'I don't know' rather than guessing." Without explicit permission, models often confabulate rather than acknowledge uncertainty. With it, they often respond well.

Design for graceful failure. When the AI doesn't know, what does the user see? "I'm not sure about that, but here are some resources that might help" is better than a confidently wrong answer. Design the failure state explicitly.

Show your work. For factual claims, show the source. "Based on your documentation (page 12, Setup section)..." builds trust and allows users to verify. Cited responses feel more reliable and allow you to catch hallucinations more easily.

What to Actually Do

If you're building an AI feature right now, here's the order of operations I recommend:

Define success metrics before you write a prompt. What does "working correctly" mean? How will you know if a change improved or degraded quality?
Build your eval suite with 20-30 representative test cases before shipping.
Implement streaming on day one for any synchronous, user-facing feature.
Log every AI call — inputs, outputs, latency, cost — from the start. You will need this data.
Design explicitly for when the AI fails or doesn't know. Don't let it be an afterthought.
Model your cost at scale before launch, not after.
Treat prompts as code — version them, test them, review changes.

The gap between "this works in a demo" and "this works in production at scale" is real. But it's navigable with the right approach. The teams that navigate it well ship AI features that users actually trust — which is the only kind worth building.