Putting legal AI through its paces: An argument for cautious curiosity

by Aaron Wenner | Apr 2, 2025

In my last post, I said that lawyers and law firms can best unlock AI’s potential by getting curious about general-purpose AI models — and that crucially, gaining an understanding of where the general-purpose models struggle can help build better expectations of what domain-specific legal AI models need to do. The key point is to adopt a curious, open-minded approach to exploring these tools’ capabilities.

And now I’m here to say: well, hold up a minute. I’m not advocating for immediately integrating general-purpose models into your practice without careful consideration. In fact, the American Bar Association has some very clear guidelines about the caution needed when using generative AI in your practice. Instead, I want to talk about some of the ways that LLMs struggle, and what we can learn about how to integrate AI into our own practices by being mindful of those struggles.

The well-known risks of general-purpose AI models

Some of these risks are so well-known that they don’t bear a lot of repetition. For example, relying on what ChatGPT will happily hallucinate can lead to some serious consequences, of which the Morgan and Morgan debacle (lawyers receiving court sanctions for submitting fake, AI-generated cases) is just the latest example. Or whether a conversation with ChatGPT is privileged or confidential. Consider the example of a UK government minister who recently had his chat history publicly disclosed, and imagine the risks of a similar disclosure in a legal context.

These are examples of areas where reliance on LLMs have failed in (sometimes spectacular) ways, but where the failure points are well-understood. And for practitioners exploring how to integrate AI into their practice, a little common sense and good judgment can go a long way.

The more subtle risks

There are other failure modes that I think are more interesting and less well-known. These are areas where LLMs fail to meet expectations in more subtle ways.

Some examples:

Not knowing when to stop digging. Imagine you’re drafting a summary judgment motion and realize partway through that a key factual issue must first be established through discovery. A human litigator would stop drafting and pivot to requesting discovery. Current LLMs aren’t good at recognizing these types of task-based dependencies, and will often continue generating legal arguments based on assumptions about unresolved facts.
Jumping right to solutions. Unless they’re explicitly prompted, current LLMs will attempt to fill in the blanks in your instructions using the data in their training sets; if they don’t know what your requirements are, they’ll work on a solution first. For example, you might ask an LLM to draft a strategy memo for settlement negotiations. But instead of first clarifying your goals, budget, or client’s priorities, the LLM immediately generates a detailed settlement proposal without considering considered whether settlement is even desirable from your client’s perspective.
Being a blank slate every time. The LLMs you’re working with have little to no working memory. (ChatGPT does offer a limited memory capability). That means that when you ask an LLM to perform a task, it need to reconstruct enough context to do what they need to do. Each new interaction with an LLM requires you to provide enough information to the model to get the job done.
Not knowing its own limits. Current LLMs aren’t great at knowing what they don’t know. As we’ve seen many times, LLMs will confidently state things that sound reasonable but have no basis in fact. This can manifest as obvious hallucinations, such as fabricated cases, or more subtly as a misphrase of a client’s testimony.

Interestingly, three out of four of these common deficiencies are also found in humans. And just like working with humans, it’s your job as a user of these tools to know what the LLM doesn’t know and work around these limitations.

Working around the failure modes

I’ve developed what I think is a good approach to integrating LLMs into my practice. Initially, my interactions with generative AI were frustrating. Requests like, “Write a summary of this legal issue” often produced outputs that looked superficially acceptable but lacked depth or introduced subtle inaccuracies.

Part of the problem was treating the AI like a human, and expecting it to produce output that was as good as what I’d write myself.

Then I flipped the script. Rather than asking AI to produce finished legal documents, I asked it to ask me questions. I’d present my notes and explicitly instruct the AI: “Don’t write anything yet. Ask me good questions first.”

For me, the results were pretty groundbreaking. The questions the AI produced were good enough to clarify and refine my thinking. More importantly, my response to those questions gave the LLM enough context and direction to produce drafts grounded directly in my own best thinking.

Practical Guidance: Maintaining Thoughtful Curiosity

Here’s practical advice for maintaining active, thoughtful engagement when experimenting with generative AI:

Actively Shape AI Interactions. Instead of passively requesting finished documents (“Draft this brief”), explicitly instruct AI to interrogate your own thinking first. For example, say, “Ask me detailed questions about this legal argument,” then build responses from your reflections.
Critically Interrogate AI Outputs. Look carefully at every AI-generated idea, checking for subtly incorrect reasoning or overly simplistic analysis. Don’t settle for something that simply looks plausible; push for genuine nuance and originality.
Perform Comparative Checks: Always cross-verify AI-generated ideas against your intuition and traditional research methods. Treat AI as a collaborator whose suggestions require rigorous evaluation—not a reliable shortcut.

Applying the lessons

To take things back to where we started: general-purpose LLMs are enormously capable, but come with a range of obvious and subtle failure modes that users need to learn how to work with. Domain-specific LLMs do their best to cover those failure modes: in their specific area of expertise, they need to do better than the general-purpose LLMs. So, as the audience for legal products that leverage LLMs, it’s much easier to assess whether those products are worth the cost and effort to adopt if we have our own “ground truth” to compare them against.

That means working intensely with general-purpose LLMs, becoming familiar with their range of capabilities and flaws. And by getting better at using general-purpose LLMs, we’ll develop a set of transferable skills that we can leverage when using the legal LLMs that we do adopt.

Troubleshooting your Microsoft Word Table of Authorities: A Deeper Dive into Field Codes

What happens if a citation doesn’t appear correctly in the TOA? For example, perhaps a pincite or errant ¶ made its way into the citation? Or the same citation appears multiple times? These are all problems that can be fixed by directly editing field codes.

Creating Tables of Authority in Microsoft Word – The 2025 Jurisage Guide

In this series, we’ll take a deep dive into how to use the most common piece of legal technology available today — Microsoft Word — to generate TOAs.

Introducing CiteSense and Legislation 2.0

CiteRight introduces groundbreaking enhancements with CiteSense citation recognition and powerful legislative excerpt management.

Five useful resources for Tables of Authorities

Tables of Authorities are complex, finicky things. Luckily, some really smart people have figured out a number of useful tips, tricks, and best practices to wrestle TOAs into order. Here are a few of the ones we like the best.

How I Handle Tables of Authorities (Without Losing My Mind)

Tables of Authorities often show up when you’re busiest. In this post, I share my practical approach to handling TOAs—simple habits, helpful tools, and lessons learned from balancing deadlines with detail.

Creating Tables of Authorities: What the Options Look Like

Since preparing TOAs isn’t exactly pleasant, there are a lot of tools that aim to automate them. We’ve put together this list of TOA tools to help you understand their pros and cons.

Why Benchmarks are Important — Especially in Law

In a recent blog post, we wrote about Cautious...

Using AI as a collaborative, creative tool

A recent study from Harvard Business School on working alongside LLMs has lessons for lawyers and other professionals who want to leverage AI effectively in our own practices.

For Legal Teams Evaluating Generative AI, Curiosity is Your Best Asset

The best way to understand and evaluate specialized legal generative AI is to first explore general-purpose AI—not because it works perfectly, but precisely because it doesn’t. Its shortcomings help you identify exactly what domain-specific AI providers are trying to fix.