The Limits of AI Learning: Why Drilling Down Can Lead You Astray
The more you ask follow-up questions, the worse the answers get.
“Should you trust the output of AI chatbots?” isn’t an easy question to answer, because answering it isn’t so much “Yes” or “No” as it is a complex and ambiguous flowchart. What are you asking about? How in-depth do you want to get? How controversial or unsettled is the topic? What are the downside risks of relying on the answer? How far along the spectrum from total distrust to unquestioning trust do you want to be? Etc. Etc.
Getting these answers right matters a lot because, whether you’re happy about this fact or not, it very clearly is a fact that large language models (the tech powering the current wave of AI) is here to stay, and will see wide use. And one way they’ll see wide use is as tools for learning.
Which is why I want to push back on the advice the economist Tyler Cowen recently gave for using AI’s while reading, arguing an AI chatbot like ChatGPT, Claude, or Gemini “is smarter than most of the books you could jam into the context window. Just start asking questions. The core intuition is simply that you should be asking more questions.” He gave the example of reading a book about the history of India, coming across an event he was unfamiliar with, asking ChatGPT what it was, and then asking further, follow-up questions going into more detail.
Just last month, I wrote about my enthusiasm for AI as a tutor for children. My son had used Google Gemini’s pretty remarkable conversational mode to help him understand how concave and convex lenses work for his sixth grade science unit. I’m bullish on this technology, particularly for kids in bad schools, or whose parents can’t help them, or who don’t have access to human tutors. Which is a lot of kids.
My enthusiasm tempers as you get outside of that. I wrote, “I wouldn’t recommend a graduate student rely on an AI tutor to learn about obscure interpretations of Hegel. But a middle schooler learning the basics of refraction is on pretty solid ground.”
And this is where I have to disagree with Cowen’s advice.
The problem with his “keep drilling down” approach, especially when learning about complicated topics at a level higher than middle school and maybe high school, is that LLMs are pretty good at answering surface level questions about topics they have a ton of training data on, because “predict the most likely response” has a lot of guidance to be correct. There’s a ton of information out there about refraction in concave and convex lenses, especially at the level a sixth grader is likely to be asking. But on more obscure for less settled topics, such as the intricacies of Indian political history, a few things happen that make the information the bot gives you less reliable.
The further you are from topics and questions with sufficient representation in the training data, the less reliable LLMs as “predict the most likely response” technology becomes. One way to think about how it gives you answers is it asks, based on the documents fed into it, “What do most sources say about this?” But narrower questions mean that there are fewer sources to draw on.
The more you drill down, the more likely you are to be in areas where the sources that do exist disagree. Not every question is a settled question, and more complicated questions on narrower topics are more likely to be those without clear consensus. This reduces the AI’s capacity to get the correct answer by giving the most common answer. Making matters worse, AI’s aren’t great at saying, “I don’t know” or “It’s complicated,” and they’re not great at weighing the relative merits of the different sides’ arguments, because they can’t assess.
As you drill down—as you keep asking the AI questions and follow-up questions—those questions, and the AI’s responses, become part of the conversation’s context. Future answers the bot gives you will be based not just on its underlying training data, but on what it’s said in the conversation it’s having with you. As a result, if it does give you incorrect information, that incorrectness will cascade, as it doubles down on what it got wrong, and expands upon it.
Taken together, these three features of LLMs mean that, the more you drill down, the more you follow Cowen’s “core intuition ... that you should be asking more questions,” the more likely it becomes that your AI conversation partner will lead you astray.
If you stick to basic questions about uncontroversial and widely written about facts and concepts, you’re on pretty solid ground with AI chatbots. If you instead ask challenging questions about more obscure facts and concepts, and ones about which there’s disagreement among experts, you’ll run into trouble.
But there’s a pretty broad area between those poles, where judgement is needed to know when you’ve gone far enough in your drilling that you should stop trusting the answers. The problem is, knowing when to stop is awfully difficult. Chatbots don’t say, “This is where I’ve become unreliable.” In fact, chatbots speak confidently no matter how far afield they’ve become, and if you press them, they will patiently explain their position, and in a way that’s remarkably persuasive.
My standard test when playing with a new AI learning tool is to ask it about Buddhist philosophy, and for a few reasons. It’s a topic I know well, so I can spot wrong answers. It’s a topic about which much is written, and so the basic ideas are amply represented in the training data. But it’s also a topic where, especially as you drill down, Buddhism experts disagree, both because “Buddhism” is really a wide range of “Buddhisms” branching out into diverse philosophical traditions, and much of what we know about Buddhism is found in very ancient and very obscure texts, and so our picture is incomplete, and codified in dead languages open to opposing interpretations.
My experience using these bots maps to what I’ve described above. If you ask pretty basic questions (“What are the Four Noble Truths?” or “What did the Buddha have in mind by ‘suffering?’”), you’ll get answers every bit as good as many human written short introductions. But if you drill down, you very quickly start getting confidently asserted answers that are actually minority or, at best, plurality positions. Or you get answers that are specific to one tradition in Buddhism, but with the bot treating them as representative of the whole of Buddhism. And, as expected, the tools don’t tell you that. So, unless you come into the conversation with expertise, you’ll end up in far off tributaries while believing you’re still in the middle of the main river.
This is why, while I share Cowen’s enthusiasm for AI as a way to explore topics, I think his “keep asking questions, trust the answers, and then keep asking more questions” isn’t quite right. Instead, start by asking the bot a question or two, especially if the topic you’re interested in is pretty basic. But the moment you get beyond encyclopedia article level introductory depth, pivot to actual human expert written sources.
Fortunately, for most people, and most people’s interests, scratching the surface is good enough—and AIs are, for many people, a particularly powerful and helpful way to scratch the surface.
Reply