The Many Ways that Digital Minds Can Know

Detractors of LLMs describe them as blurry jpegs of the web, or stochastic parrots. Promoters of LLMs describe them as having sparks of AGI, or learning the generating function of multivariable calculus. These positions seem opposed to each other and are the subject of acrimonious debate. I do not think they are opposed, and I hope in this post to convince you of that. In particular:

  1. LLMs do both of the things that their promoters and detractors say they do.
  2. They do both of these at the same time on the same prompt.
  3. It is very difficult from the outside to tell which they are doing.
  4. Both of them are useful.

In this I’m going to introduce some new terminology that I think will be useful to reason about them, and hopefully shed some of the connotations and baggage of prior terminology that are poorly suited for the phenomena that they are now tasked with describing. There will not be any numbers or quantification of any of this, but I hope it will be a useful way of thinking about things anyways.

Search Index Size and Memorization

In the 2000s when Google and Yahoo search were neck and neck competitors in quality, one of the biggest differentiators between search engines was the size of their index. This may be surprising to people who were not paying close attention to the search industry at the time. After all, the popular conception of search engine quality was based on stories of heroic algorithm innovation, and deep insights about the web, (which to be fair were also very important.) But at the time, search startups launched using their index size rather than any particular ranking technology as their primary marketing claim to superiority over the incumbents, and this was not seen as strange. Compared with something like PageRank, index size seems like an awfully prosaic thing to make a big difference in search quality. Why was it so important?

When people think about index size, they intuitively tend to think of things like, “Does the search engine have the specific page I’m looking for?” but that’s a very small part of the impact, and something that saturates quickly. There are relatively few pages, relative to the size of the web, that people remember specifically and seek out by name. It’s easy to cover those, and every competitive search engine did. Instead, a large index gave search engines the ability to satisfy increasingly niche queries.

Suppose the user issues a query like “What’s the best infant car seat for a 2006 toyota corolla.” A document that addresses this query specifically and directly is likely to be a forum post, and probably an obscure one. This is not a super hot topic that is going to get to the top of a social media site. It’s the sort of thing where if there’s any page about it at all, it’s the result of a single person looking for advice, finding a group of similar people online, asking them, and then hopefully getting a few replies. Because of its obscurity, you have to index a large number of pages to be sure that you get this exact one, if it exists. The space of possible queries explodes combinatorially with the length of the query, so the size of the index has to grow exponentially to satisfy increasingly long queries. Through this mechanism, a large index expanded the breadth and depth of the questions that a search engine could answer, and the needs it could satisfy. 

There are however other options, other than index size, for how a search engine could solve this. Suppose that you don’t have anything about this query for “2006 toyota corolla” specifically, but you do have pages about it for “2007 toyota corolla” and you happen to know that the corolla didn’t change much in those years, so cars of those two model years are effectively synonyms with respect to what car seats will fit in them. Or suppose you don’t have a page about “infant car seats” specifically, but you have a page written by a self-described “expectant mother” looking for a car seat, and you can surmise that everyone in the conversation understands that the kind of car seat they’re looking for is an “infant car seat” despite never being explicitly described as such. Also, “Best,” despite being the first word in the query and very important for the intent, is just asking for the kind of page that ranks carseats, not pages that contain “best.” Most pages that list more than one car seat are ranking them in some order, so you can probably just drop that word entirely.

These kinds of inferences allow you to return pages that slightly deviate from the original text of the query, but in a way that is very likely to still satisfy the user’s question. When a search engine is able to do this, it is able to compensate for a limited index size with intelligence. By making reasonable inferences about what page text is likely to satisfy what query text, it can satisfy more intents with fewer documents. If you don’t have the capacity to index more documents, or if there simply aren’t any more to index, then you need to do something more clever to answer more and more specific queries. You have to be able to collapse queries or compose queries into a smaller space of semantics to find documents that are useful even if there is no syntactic connection to the raw text.

I’m going to make a loose analogy between these two aspects of search engines, and two words that people often use to describe deep models, “memorization” and “generalization.” It will not be a perfect analogy, the ways that people use those words are too varied for the connection to be precise, but I do think it is instructive.

A bigger index size reflects more “memorization.” A search engine does not have the option of changing the content of the documents that it indexes, and so the set of things it can return is always constrained to be exactly the set of points it has already seen. It cannot synthesize a new document by interpolating the set of documents that exist. If a language model were to output the exact text of a document in its training data, we would call it memorization, so when a search engine does that by returning a document, I’m using the same word.

In contrast, better language understanding and world knowledge inference reflects better “generalization.” The search engine is learning how to go beyond the text on the pages. It is satisfying more queries by being smarter rather than by indexing more. 

Note how different these concepts are for a search engine than a conventional statistical model. Generally we think of “memorization” as bad, as an indication of overfitting, and the degree of memorization as mutually exclusive with the degree of “generalization.” When a model relies on memorization instead of generalization, we expect it to perform more poorly outside of its domain. But the connotations of generalization and memorization, as good and bad, respectively, do not reflect how language sophistication and index size, respectively, affect a search engine. For a search engine, both memorization and generalization help its performance, there is not a tradeoff between the two. Rather than worrying about tasks “outside of its domain” the goal of a search index is simply to cover the entire domain, and whatever language understanding you can do on top just expands it further.

The Ambiguous Depth of LLMs, Integration and Coverage

Information Retrieval systems have been good at answering questions for a long time, even with very rudimentary techniques. It did not require deep learning for a search engine to beat every human alive at Jeopardy. Language evolved to compress thoughts, to communicate those thoughts tractably from person to person. Text, by design, compresses knowledge well enough that simple techniques work, even just matching discrete tokens against a big corpus. But because we can see the corpus, and see exactly where the information is coming from, there’s no mystery. We know that the “mind” of a search engine is separate from the internet that it indexes. The text is “outside the Chinese room,” and the “mind” is just following simple rules, even if the emergent behavior from the two combined is immensely impressive and valuable.

LLMs are not like this. The reasoning that they do is inscrutable and massive. They do not explain their reasoning in a way that we can trust is actually their reasoning, and not simply a textual description of what such reasoning might hypothetically be. Their output is seamlessly adapted to their context, and different every time. They do not cite their sources. The text that they return is not stored separately from their reasoning about that text. The reasoning and the text are commingled, such that it is not possible from their output alone to tell where one ends and the other begins. Without that separation, and without the ability to deeply inspect what’s going on inside of the model, it is very difficult to tell the level of abstraction at which the LLM is operating. Has the model memorized an example exactly like this, or is it reasoning from scratch? Has it seen something very similar to this but reworded enough to foil exact text search? How could we tell?

Despite these fundamental and transformational differences between search engines and LLMs, I believe that aspects of search engine quality are better analogies to describe the properties of LLMs than “generalization” and “memorization” are, and the use of those terms is clouding the debate. Instead, I’m going to use the terms “integration” and “coverage.” “Integration” you can think of as “How much of the information that the model knows is integrated into a coherent representation that can be applied in new circumstances? How abstract is its reasoning?” While “coverage,” analogous to index size in a search engine is, “How many of the relevant facts/examples of the task/similar bits of text has the model stored?”

You might think of “integration” as a measure of compression. Having a more sophisticated internal model should intuitively be both a more integrated representation and a smaller one (compare generative rules of grammar to ngram counts, the laws of physics to particular physical simulations.) But like the memorization / generalization dichotomy, this also may obscure more than it clarifies. LLMs do not compress their input data by being forced to function with fewer parameters the way a bottlenecked autoencoder or PCA does. We should not think that reducing parameter count while increasing training data, nor forcing the model to be sparse, would necessarily cause integration to go up. The training of deep models has its own complicated combinatorics that make models that are conceptually compact unlikely to be small in parameters.

Another good candidate terminology for the quality of integration would be “abstraction.” I use integration instead because LLMs build their models out of the aggregation of small observations rather than by applying top down rules. Even if the ultimate result of the learning process is an abstract model, that model is achieved by integrating bits of information and explaining them with a unified core. The key quality that differentiates deep models from their predecessors is that all of the information is permitted to interact, and forced to interact. The quality I’m seeking to define describes the degree to which it interacts in practice. 

Why would search engines be a better analogy despite the enormous difference in their basic mechanisms? One reason is scale. LLMs are large. GPT-3 has 175 billion parameters, and GPT-4 is much bigger. The internet is also big. It contains more things and more outlandish things than you would ever expect. One of the highlighted results from GPT-4’s launch was its ability to draw a unicorn in Tikz. This is not the sort of thing I would have expected to exist in any remotely similar training data on the web. However, it turns out that there’s an entire ctan package of cartoon animals drawn in Tikz. While the physical construction of a unicorn in code would be astonishing for a model to reason through from scratch, the knowledge that a unicorn is a pink horse with a horn, or the ability to translate from SVG to Tikz, is less surprising.

Importantly, the scale of LLMs is similar to the scale of the web itself. GPT-3’s 175 billion parameters is 175 parameters for every document in year-2000-Google’s index, and year-2000-Google could do a lot.

For a model with a high level of integration, the sort of coverage we might be most concerned with is “patterns of valid reasoning” or “facts about the world” whereas for a model with low levels of integration the kind of coverage we’re worried about is “existence of particular ngrams and their statistical associations.” Let’s work an example. Suppose you ask an LLM a question from a physics exam. There are many possible methods of reasoning that the LLM could be using to answer that question. I’m going to be listing them in order of “integration” from least to most, and beside it, explain what “coverage” means at that level of integration.

Integration (least to most)Coverage (an independent axis)
The model has seen this exact question before in its training data and memorized the answer.The larger the corpus the more questions it has the opportunity to memorize.
The model has seen a semantically equivalent question in the training data, with only changes in wording that are irrelevant to the answer.The larger the corpus, the more examples it has to compare against for semantic similarity.
The model has seen an equivalent question in the training data that differs only in the numerical constants. The model has learned the mathematical relationship between those constants and can do computation to produce the answer.The larger the corpus, the more worked examples of word problems and their corresponding math it has to draw from, and the greater chance it can find the exact formula to apply for the present case.
The model has seen questions in the training data that are equivalent to parts of the present question. It can recognize the correspondence of each of those templates to the parts of the current question, and compose their associated math to compute the correct answer. The larger the corpus, the more examples of paired text and math it has to learn from, and more opportunities to learn how to correctly compose them.
The model understands the physics of the objects described in the problem. It can retrieve the associated physical laws, and apply them correctly to produce an answer.The larger the corpus, the more exposure it has to the laws of physics, and the associations between those laws and the objects to which it is appropriate to apply them.

There are a few things I want you to take away from this example.

  1. No matter the level of “Integration,” and the sophistication of the reasoning, more “coverage” is always helpful.
  2. From the outside, you would not be able to tell which strategy it is using for this example, and in fact it is likely to be using many of these strategies at once and combining them all into the final token probabilities.
  3. All but the first, the least integrated model, would foil a text search for the question in the training data. Once you’ve ruled out that level, ordinary information retrieval on the corpus cannot differentiate the higher levels.
  4. Most importantly, all of these strategies are equally successful at producing the correct answer for this particular task.

(Though it is unrelated to my main point, I also want to point out that the possible methods of reasoning far outstrip the meager examples I’ve listed here, and most would probably be too strange to describe in words.)

I want to repeat point 4 again, because I think that is the most likely to be missed or misunderstood. The level of sophistication of the reasoning that the model is doing is not relevant to the quality of the product in this context, or to its usefulness for this task. A model that absorbs more of the web will be more useful regardless whether it’s “smarter” or not, for the same reason that a search engine with a larger index is more useful than a search engine with a smaller index, even if the ranking algorithms do not change. The distinction between memorization and reasoning will not actually matter for a lot of use cases, because ordinary document retrieval is immensely useful, and even “light” semantic reasoning on top of that is more useful still. A model that gets correct results through cheap tricks still gets correct results. A right answer, consistently delivered, is a right answer. 

The philosophical debates about where the model is on the parrot/AGI spectrum are independent of that. Our model of its reasoning is what we use to predict the trajectory of its future capabilities, but those capabilities are ultimately themselves empirical.

Different Minds, Different Metrics, Different Mixtures.

I’m going to put some models on a graph with two axes, integration, and coverage. This graph is entirely vibes-based. These are totally wild guesses on my part, and I welcome people arguing with them. (Importantly, it does not contain or reflect any private information I know about Google or its systems so if you are using this to speculate about that you’re doing it wrong.) The points on the graph that describe human organizations are even more vibes-based than those about models, so please don’t get too hung up on them. Also there are no units.

Different tasks lean on different mixes of integration and coverage, but the metrics and benchmarks that we use to evaluate LLMs do not differentiate them. A given level of log loss does not tell you what kind of strategies the model is most often relying on. Models that are very different in their training regime, that trade smaller size for more data, or less data for larger size, likely have a very different mix of strategies, and perplexity alone is not enough to tease them apart.

Neither are benchmarks enough. It is notoriously hard to prevent data pollution from any standard benchmark leaking into the training data, and so the vast majority of tests one might try are things the model has already seen. Even if you can prevent direct leakage, these sorts of rudimentary checks do not allow you to differentiate anything beyond the very first level of integration. Benchmark tests are adapted from other tests that are very conceptually similar even if not textually similar, and that conceptual similarity is not something you can assess with simple text search.

I’ve put some tasks on the same graph as vectors instead of points. These roughly point in the direction of what I think the the gradient w.r.t. improved performance of the task is for these attributes.

When we see a model that performs well on benchmarks but then is disappointing to interact with, the cause may be that the benchmarks reflect a different mixture of these components than actual use. Different model choices likely change the balance of integration/coverage. These some of my hunches about how different choices affect things, but these are also just wild guesses.

  1. Increasing both model size and corpus size increases both integration and coverage. 
  2. Mixture of Experts models probably decrease integration relative to unified models with the same number of params. 
  3. Training longer on a smaller model probably increases coverage while it reduces integration.

Right now this is just a brain dump and extremely far from being a real theory, let alone a quantifiable one, so let me list some directions of research that could shed light on this way of thinking, or falsify it.

  • Mechanistic interpretability research to quantify the strategies an LLM is using for any particular task
  • Retrieval of similar examples from the training data corpus using the LLM’s hidden state, combined with subjective human evaluation of the similarity of the training instances to the task. The retrieval model must be at least of comparable language ability to the model it is attempting to explain the behavior of.
  • Training lots of models with different data/size tradeoffs, and evaluation on suites of tasks that cluster on these axes.

What does this mean about where we’re going?

I have no idea. But there’s one analogy that I keep coming back to. A successful model memorizes at a higher level of abstraction. Once you reach the right level of abstraction, you’re done. You’ve learned to think.

2 Comments

  1. Neat contrast between ‘integration’ and ‘coverage’. I like how you included ‘vibe-based’ too for that chart 🙂 Something that seems relevant to ask is at what scale of these systems are you viewing them such that you can actually measure integration? IIT attempts to answer this question. To say that LLMs are more integrated than google search or vice versa does require a theory of how integration take place.

Leave a comment