Comment by bildramer on 01/02/2025 at 15:21 UTC

4 upvotes, 2 direct replies (showing 2)

View submission: Propositional Interpretability in Artificial Intelligence

At least 90% of the interpretability problem (at least wrt LLMs) comes from propositions being a lossy summary very loosely related to actual facts or behavior. Back in the days of GOFAI, many, many people thought wrongly that you could make the computer get human commonsense knowledge by teaching it enough words, when the real problem is that the computer only really "sees" something like <noun37> <verb82> <noun25>.

Modern LLMs are like aiming trillions of flops of brute force at the problem - it may appear solved to us, the output speech acts appear to be goal-directed and may even be useful sometimes, but the disconnect is still there. The summary happens *after* whatever solution process happens involving whatever unknowable computations in a different latent space. Why believe such a summary is accurate? How does the summarization happen? Answering such (important!) questions is mechanistic interpretability, and propositional interpretability by definition can't answer them.

Replies

Comment by ArtArtArt123456 at 02/02/2025 at 00:46 UTC

6 upvotes, 1 direct replies

when the real problem is that the computer only really "sees" something like <noun37> <verb82> <noun25>

...as opposed to what? *real* words with *real* meaning?

Comment by Idrialite at 01/02/2025 at 22:57 UTC

2 upvotes, 0 direct replies

When you get a response from an LLM, there's a lot of tokens involved, and each one signifies a point where the internal activations are completely lost. Over the course of its thinking and then answering, the model has to use the tokens themselves to continue its cognition. So that's somewhat in favor of a close relation between the model's internal... thinking, whatever, and the actual text it outputs.