Chatbots and the Chinese Room

On the Mastodon part of the Fediverse, Tim Bray recommends taking a "wait and see" attitude regarding the current LLM apocalypse:

1. The claim that they're glorified Markov chains, just stats engines, is wrong.
2. Some critics are essentially echoing Searle's Chinese Room argument, which is wrong.

Tim Bray's toot

The Chinese room argument

He says a few more reasonable things, but i want to push back on these, because I think they give the purveyors of LLMs too much credit.

For the first claim: it is wrong only in the details. If you take out any mention of Markov chains, but keep the claim that LLMs are just stats engines, the claim is right. LLMs are vastly more complex than Markov chains, both in program design and language corpus. But they /are/ still just statistics engines. As I saw it pithily explained, to an LLM, the only difference between the phrase "Neil Armstrong was the first man to walk on the moon" and the phrase "Neil Armstrong was the first man to walk on Mars" is that the former is more likely to appear in its training corpus.

For the second: I agree that Searle's Chinese room thought experiment is wrong, but disagree that the /reasons/ it is wrong are operant in the case of LLMs. I'm not going to tire my fingers out repeating the whole argument, the link to the Wikipedia page is above.

The reason that the Chinese room argument is wrong is that it imagines the computer as the "man in the room" that doesn't understand Chinese. But the computer is actually the complete system of man and room, and all the instructions and rules. The homunculus in the machine doesn't have to understand Chinese for the whole machine to understand Chinese.

/But/ LLMs do not even measure up to the Chinese room system! It is stipulated in the argument that the machine perfectly emulates the speech of a native speaker of Chinese. Searle argues that programs are entirely syntactic, while minds have semantics. In the case of the Chinese room, in work as stipulated, the system-as-a-whole /must/ encode semantic meaning in order to avoid being caught in contradiction, among a number of other failure modes. A friend recently shared this transcript of someone talking to ChatGPT:

,----
| Human  > Name a word with  silent "v".
| ChatGPT> "Salmon" has a silent "v".
| Human  > Where is the "v" in that word?
| ChatGPT> In the word "salmon", the "v" is silent and is not
|          pronounced. The letter "v" appears in the spelling of the
|          word, but it is not pronounced when the word is spoken. The
|          correct pronunciation of "salmon" is "SAM-un" with no "v"
|          sound.   
`----

This is not a mistake that any literate native speaker of English could make, nor could anyone with a knowledge of the Latin alphabet. The Chinese room is /stipulated/ as part of the thought experiment not to make this kind of mistake. ChatGPT can make this mistake, because it "knows" what discussions of silent letters in its corpus look like, but it doesn't know what it means for a word to have a silent letter in it. The Chinese room would have to know; it would have to incorporate semantic knowledge - the relationships between concepts, and not merely between strings of words.

Some AI boosters, including in academia, claim that LLMs /do/ derive meaning, or at least, some different kind of meaning, from their corpuses; that the meaning of words can be found purely in how they are used, and not in relation to /what they signify/. I think the current generation of chatbots shows that this is not true, and I think it implies that overcoming the limitations of LLMs will require switching to or incorporating some other architecture that does deal with the signified. Human intelligence, while vastly overrated and used mostly for rationalization, at least floats on a deep well of animal intelligence that engages with the World-In-Itself, something for which there is not /as yet/ a machine equivalent.

I do not think general AI is impossible, though I think we are much farther from it than the AI-hype/risk community believes we are today. And I think that LLMs have /probably/ gone as far as they are going to, given that they have essentially exhausted the existing corpus of human communications to date and any larger corpuses in the future are likely to be poisoned by including LLM-generated material. I predict that improvements to LLMs will come only in the form of improvements to computational efficiency, not to output performance.

That's it as far as my argument goes. As a disclaimer, I'm not an AI researcher; I'm a working webshit programmer with most of a PhD in anthropology and just enough formal linguistics to be dangerous. The core of my outlook on LLMs and similar has been shaped by Emily Bender and her collaborators, who are much closer to the coal face than I am.

A digression with space opera

On a not-very-closely-related tangent, I want to ask, "What are chatbots good for?". The program that eventually became Google Assistant was internally codenamed "Majel", after Majel Barret-Roddenberry, the First Lady of Star Trek, known for playing Number One (Una Chin-Riley), Christine Chapel and Lwaxana Troi, and, more pertinently, the voice of the Enterprise computer in both TOS and TNG. The idea was obviously that the Google Assistant would act like the Enterprise's computer: you would give it instructions in ordinary language, and it would carry them out to the best of its ability. Given the technical limitations of the time, even when the speech-to-text software recognized your words correctly, it could do little more than pattern-match some functions of the pre-installed Google apps, launch programs by name, or search for things. I don't now if it's any better now; I disable or uninstall it as the first thing I do when I get a phone. Since Google Assistant launched, Google has gotten heavily into LLMs, as have its competitors.

But let's look again at what the Enterprise computer does, a typical interaction. Chief Engineer La Forge tells the computer, "Computer, plot the density of tetrion particles in the nebula against the positions where the Romulan warbird was observed to cloak or decloak, and display on screen." The computer responds with a terse "Processing... complete", or, if it can't do it, "Unable to comply" or "Insufficient data". Lieutenant commander La Forge might ask the computer clarifying questions while looking at the display, like "What is the average difference in density of tetrion particles between this region and this region of the nebula?" and expect a short, factual verbal answer.

To do all of this, the computer needs to understand the referents of the question (nebula, Romulan warbird, tetrion particles) and relate that to its known capabilities (operating the Enterprise's scanners, reviewing recorded data, displaying data on screen). /Some/ of this is a language comprehension and production task, but look how much of it isn't, and how terse its language production is. In a sense, LLMs are the opposite of the Enterprise computer: they excel at speech production, and can be quite verbose, but even when hooked up to actual capabilities (like Sydney's ability to use Bing search), they often don't be able to relate the questions they're asked to those capabilities, sometimes confabulating search results from their training data rather than performing and summarizing a search. And a LLM will never respond with "unable to comply" or "insufficient data", but rather confabulate a response, often appearing to double down on a factual error when it's pointed out, or to berate its questioner (as seen with Dark Sydney).

So a LLM isn't a useful tool that understands voice commands and uses them to orchestrate other tools and summarize their results, like the Enterprise computer is. Instead, it's... what? All the currently plausable use-cases involve producting intentional misinformation of some kind, whether marketing copy, spam, or astroturf. It's as if we've automated neurotypicality, something we already have quite enough of, thank you very much.