AI products introduce a layer of complexity that standard discovery methods are not designed to handle. The system makes probabilistic decisions. It will be wrong sometimes. Users will have emotional reactions to those mistakes that they cannot accurately predict. And what users tell you they will do when confronted with a faulty output is rarely what they actually do.
This requires a different approach to research, starting at the very beginning of discovery.
Start by Watching, Not Asking

The instinct when building something new is to go ask users what they need. The problem is that people are not good at articulating what they need, especially for problems they have never imagined being solved differently. If you ask someone what you should build for them, you get back a wish list that reflects their current mental model, not an accurate picture of the actual problem.
What you need instead is a clear view of how your users work today. That means getting into their environment and observing directly. Watch what they do step by step. Pay attention to where they slow down, where they switch tools, where they repeat the same action. Ask them to walk you through their most common workflows and where their time goes. Ask where they consistently have to fix something downstream, or where they feel like they are doing work that should not require their judgment at all.
Observation reveals things that direct questions cannot. Users will tell you a process is fine while simultaneously spending twenty minutes on a workaround they have accepted as just part of the job. The gap between what people report and what you observe is often where the most valuable product opportunities sit.
You are also watching for the decisions users make along the way. Where are they applying judgment, and why? What information are they weighing? Understanding their decision logic matters enormously for an AI product, because those are the decisions you may eventually be asking a model to assist with or automate entirely. If you do not understand what inputs a good decision requires and what tradeoffs it involves, you cannot evaluate whether an AI can reliably support it.
Assess Feasibility with the Right Questions

Once you have a clear picture of the problem, the next question is whether an AI solution would actually be usable, not just technically possible. Technical feasibility is necessary but not sufficient. A model that can perform a task in the lab and a feature that users will actually rely on in production are different things.
AI systems produce errors. The model you ship will hallucinate, misclassify, or miss context. The rate at which that happens is something you can work on, but it will never reach zero. The relevant question for your research is not whether users can tolerate imperfection in the abstract. It is whether the specific imperfection of your specific system, applied to this specific task, is tolerable enough to be useful.
Frame the error rate explicitly in your interviews. Ask something like: if this worked correctly 80% of the time, would that be genuinely useful to you, or would the errors create more work than the tool saves? This kind of question forces users out of the hypothetical and into a real evaluation. The answer also varies significantly by context. An 80% success rate on a low-stakes categorization task might be perfectly acceptable. On a task where a wrong output triggers a downstream problem, it might disqualify the feature entirely.
Understand the Emotional Response to Errors

Error rate tolerance is not a purely rational calculation. How users respond emotionally when the system fails is just as much a feasibility input as whether the error rate is low enough to be useful.
Research consistently shows that people respond to AI errors differently than they respond to human errors. Dietvorst, Simmons, and Massey found in a series of studies that people lose confidence in an algorithm more quickly than in a human forecaster after seeing them make the same mistake. Users hold AI to a higher standard of consistency, and a single visible error can be disproportionately damaging to trust. A follow-up study published in Management Science found that giving users even limited ability to adjust or override an algorithm’s output significantly reduced this aversion, suggesting that control and perceived agency matter as much as raw accuracy.
You need to understand not just whether a user finds an error acceptable in principle, but what their actual reaction looks like. Does it undermine confidence in the whole system, or do they treat it as a one-off? Do they immediately stop using the feature, or do they move on? Will they check the output every time, or will they start trusting it without verification once they build some experience with it?
You cannot fully answer those questions through an interview, but the answers give you a starting picture of what conditions users need in place before they will rely on the system.
Ask your users what it would take to trust a machine to do this task. Not whether they trust it today, but what conditions would need to be true for them to rely on it. Some users will describe an accuracy threshold. Others will talk about needing to understand how the system made a decision. Some will say they would always need to check the output no matter what. That last answer matters: if a user will always verify the output regardless of performance, then your accuracy floor is defined by how long verification takes, not by how often the system is right.
The stakes attached to an error matter as much as the frequency. Ask your users: how would you react if the system made a mistake on this? Some will say they would catch it before it mattered. Others will describe consequences that are serious enough that the feature cannot be designed without appropriate guardrails. The emotional weight behind the answer tells you whether you are building a convenience feature or a system that requires careful error handling and clear escalation paths from the start.
Build a Prototype and Ask Again

By this point you have a real picture of the problem, an initial read on error tolerance, and some understanding of how users respond when the system fails and what it would take for them to trust it. What you do not yet have is accurate data on how users will actually behave when the system fails in front of them. That is what the prototype is for.
People answer research questions from a place of imagination. They are picturing a system that works, and their answers reflect that picture. When the system fails in front of them, even once, they respond differently than they predicted they would. Research on trust recovery in AI systems shows that early errors, specifically errors that occur before a user has built up positive experience with a system, are particularly damaging to trust and to willingness to rely on the system going forward.
Build a prototype early enough that you can confront users with failure before you have committed significant development resources to a specific approach. It does not need to be functional. It needs to be real enough that users engage with it as if it were a product, not a mockup. Then show them what happens when it goes wrong. Present a hallucinated output. Show a confident result that is factually incorrect. Watch what they do. Do they immediately dismiss it and stop trusting the tool? Do they look for a way to correct it and move on? Do they escalate?
Then go back to your earlier questions. Ask again whether 80% accuracy would be useful. Ask again whether they would check the outputs. The answers shift, sometimes significantly. Users who said they would simply verify and move on may express more anxiety than expected. Users who said errors would be fine sometimes discover that seeing one in practice feels different.
The prototype surfaces the gap between what users say and what users do. That gap tells you where your guardrails need to go, what controls users need access to, and what level of transparency the system has to provide about how it reached a conclusion. You cannot design for human behavior around AI failure without seeing that behavior.
