February 4, 2026

We Built This Site as an AI Product Demo — Here's What We Learned

The best proof of AI capability is a system you can interact with before any sales conversation. Here's what we built and what it taught us about AI system design.

Most AI companies sell their expertise with case studies, testimonials, and "book a demo" buttons. We wanted something different: a site where visitors experience AI capability directly, before any human conversation.

The idea was simple. If Nirmano designs and implements AI workflows, the website should be a working AI workflow. Not a description of what we can do — a demonstration of it. Visit the site, interact with the system, and form your own opinion of the quality.

We built two AI systems into nirmano.com: a conversational agent at /ask and an AI readiness evaluation at /evaluate. Both are production systems that interact with real users, handle real data, and produce real outputs. Both taught us things about AI system design that you only learn by shipping.

Why a Demo, Not a Brochure

The AI services market has a trust problem. Every consultancy claims AI expertise. Most of them have a team that built one chatbot and three PowerPoint decks about AI strategy. Prospects can't differentiate between deep capability and good marketing until they're three months and $50,000 into an engagement.

A live demo changes the evaluation dynamic. Instead of "trust us, we're good at AI," it's "here's a working system — judge for yourself." The prospect's experience with the system becomes the sales conversation. They either walk away impressed or they don't, and either outcome is honest.

This approach has a cost. Building production AI systems takes more effort than writing case studies. The systems can fail publicly. Edge cases are visible to everyone, not hidden in a controlled demo environment. We decided the transparency was worth the risk.

The /ask System: Conversational Agent Architecture

The conversational system at /ask looks like a chatbot. Architecturally, it's an agent — a multi-turn AI system that decides when and how to use tools, maintains context across turns, and adapts its behavior based on what it learns about the visitor.

Multi-Layer Prompt Architecture

The system prompt isn't a single text block. It's assembled from structured components:

Identity and personality rules — who the agent is, how it communicates, what it will and won't do
Knowledge references — structured data about services, case studies, methodology, and capabilities
Tool definitions — six functions the agent can call, with descriptions of when and how to use each
Behavioral guardrails — rules about response length, when to ask questions vs. provide answers, when to suggest a human conversation
Response templates — structures for common response types (service explanations, case study summaries, comparison analyses)

Separating these concerns means we can update knowledge without touching personality, add tools without rewriting guardrails, and modify behavior without risking knowledge accuracy. It's the same principle as separating concerns in software architecture, applied to prompt design.

This matters because monolithic prompts break at scale. A 4,000-word prompt that combines personality, knowledge, tools, and rules becomes impossible to maintain. When you update a fact, you accidentally change a behavior. When you add a guardrail, you interfere with a response template. Structured prompts are more work upfront and dramatically less work over time.

Tool-Calling Agent Loop

The agent has six tools: portfolio search, case study retrieval, capability lookup, knowledge base search, chart rendering, and conversation scheduling. When a visitor asks a question, the agent decides whether it can answer from context or needs to call a tool.

This decision layer is where the quality lives. A naive implementation calls tools on every turn. A good implementation calls tools only when they'll improve the response, avoids redundant calls, and combines results from multiple tools when the question requires it.

We learned that tool selection is harder than tool implementation. Building a function that searches the knowledge base took a day. Teaching the agent when to search (and when not to) took two weeks of prompt iteration and testing. The agent needs to understand that "tell me about your manufacturing work" requires a portfolio search, but "how does AI help with document processing" can be answered from the system prompt's knowledge without a tool call.

The agent loop runs as server-sent events (SSE). Each step — thinking, tool calling, text generation — streams to the client as a typed event. The frontend renders thinking indicators while the agent works and streams text as it generates. No artificial delays, no loading spinners for completed work. The user sees the agent's process in near-real-time.

Knowledge Base Design

The agent's knowledge comes from the same markdown files that generate the site's pages. This is a deliberate architectural decision: one source of truth for both the static site and the conversational system.

The knowledge base loader reads markdown files at runtime, scores them against the user's query using keyword matching, and returns the most relevant content to the agent. It's not a vector database. It's not semantic search. It's straightforward keyword scoring that works because the content is well-structured and the corpus is small enough (roughly 20 articles) that simple approaches outperform complex ones.

We considered implementing vector embeddings. For a knowledge base of this size, the added complexity doesn't improve results. The keyword scorer finds relevant articles with high accuracy because each article is focused on a single topic with clear terminology. Vector search would add an embedding pipeline, a vector store, and a retrieval layer — all to solve a problem that doesn't exist at this scale.

This is a generalizable lesson: match your retrieval architecture to your corpus size. Vector search makes sense at 10,000 documents. At 20 documents, it's over-engineering.

Default Responses for Common Questions

We discovered that roughly 60-70% of first messages fall into a handful of categories: "What does Nirmano do?", "How can AI help my business?", "Tell me about your services," and similar introductory questions. Calling the Claude API for each of these identical queries is wasteful — same question, same context, same answer.

We built a default response system that pattern-matches common first questions and returns pre-crafted responses without an API call. The responses are written to the same quality standard as live responses. Visitors can't tell the difference. And the system handles the majority of first interactions without incurring API cost or latency.

The remaining 30-40% of first messages are specific enough that they need a live API call — "How would AI help with our freight document processing?" or "I run a 200-person manufacturing company, what should I know about AI?" These get the full agent treatment with tool calling and context-aware responses.

The /evaluate System: Structured Assessment Pipeline

The evaluation system at /evaluate is a different kind of AI demo. Instead of open-ended conversation, it's a structured pipeline: questionnaire, AI clarification, parallel analysis, and scored results.

The Flow

A visitor answers 17 questions across five dimensions: strategic alignment, data readiness, workflow complexity, organizational capacity, and technical infrastructure. After completing the questionnaire and providing their email, they enter a clarification phase — a short AI conversation where the system asks 2-4 follow-up questions based on their answers.

The clarification phase is the interesting design challenge. The AI reads the questionnaire responses and identifies areas where the answers are ambiguous, contradictory, or suggest a story worth exploring. "You rated your data readiness as high but indicated your primary data lives in spreadsheets — can you tell me more about that?" These follow-ups produce context that makes the final analysis significantly more useful.

After clarification, the system runs parallel analysis — multiple Claude API calls processing different dimensions simultaneously. The results page shows scores across five dimensions, specific recommendations, identified quick wins, and a prioritized roadmap.

What We Learned from Building /evaluate

Two-phase save architecture matters. We save the questionnaire responses immediately after the email step, before clarification begins. If the clarification chat fails, the browser crashes, or the user abandons mid-conversation, we still have their answers and can generate a partial analysis. The clarification summary gets appended later. Designing for partial completion is essential for any multi-step AI workflow.

Parallel analysis beats sequential analysis. Running five dimension-specific analysis prompts simultaneously instead of one monolithic prompt produced better results and faster completion. Each prompt is focused on a single dimension with specific evaluation criteria. A single prompt trying to assess all five dimensions at once produced more generic, less actionable output.

The clarification conversation is the hardest part to get right. The AI needs to ask useful follow-up questions without being annoying, read signals from brief answers, and know when to stop asking and move to analysis. Too few questions and the analysis lacks context. Too many and users abandon. We settled on 2-4 questions as the sweet spot, with the AI using a tool call to signal when clarification is complete.

Honest Limitations

Building these systems taught us as much about what doesn't work as what does.

Long conversations degrade. After 8-10 turns, the conversational agent's context window starts to strain. Earlier context gets compressed, and the agent occasionally contradicts something it said five turns ago. This is a fundamental limitation of current models, not a solvable implementation problem. We mitigate it by keeping a rolling context summary, but it's imperfect.

Edge cases are infinite. A visitor asked the /ask system about AI applications in beekeeping. Another asked it to write a poem. A third asked detailed technical questions about transformer attention mechanisms. The system handles these gracefully — declining off-topic requests, redirecting to relevant capabilities — but each edge case required specific handling. You can't anticipate them all. You design a framework for handling the unexpected and iterate as new cases appear.

Accuracy is probabilistic, not guaranteed. The evaluation system's scoring is based on self-reported answers interpreted by a language model. It's useful directional guidance, not a precise measurement. We're explicit about this in the results. Any AI system that claims precision it doesn't have is building trust it can't sustain.

Cost management requires active design. Without defaults, caching, and careful prompt sizing, the API costs for a public-facing AI system scale linearly with traffic. Every design decision — default responses, knowledge base size, number of tool calls per turn, model selection — has a cost implication. Building an AI product demo that bankrupts you on API costs defeats the purpose.

What This Teaches About AI System Design

If you're considering building AI into your product or operations, here are the principles we'd emphasize from this experience:

Prompt architecture matters more than model selection. We could swap the underlying model and maintain 90% of the system's quality because the quality lives in the prompt structure, tool design, and knowledge organization. Companies that obsess over model choice while neglecting prompt architecture get mediocre results from excellent models.

Design for failure. Every AI system will produce wrong outputs. The question is whether your system catches them before they reach users, degrades gracefully when they slip through, and learns from them when you discover them. Error handling isn't an afterthought — it's a core design requirement.

Start with the simplest approach that works. Keyword search over 20 documents. Pattern matching for common queries. Structured prompts assembled from JSON. None of these are sophisticated. All of them work. Sophistication should be added to solve problems you actually have, not problems you imagine you might have.

Measure everything. We track which questions visitors ask, which tools the agent calls, which default responses are served, how long conversations run, and where users drop off in the evaluation flow. This data drives every improvement. Without it, we'd be guessing.

The site is a demo, but it's also a real system serving real users. That constraint — it has to work, not just impress — forced better engineering decisions than a controlled demo environment ever would.

Want to experience both systems firsthand? Ask a question at /ask or take the AI readiness evaluation at /evaluate. Then decide for yourself whether we know what we're doing.

←All Posts Discuss this with Nirmano→