Propositional Interpretability in Artificial Intelligence

https://arxiv.org/pdf/2501.15740

created by F0urLeafCl0ver on 01/02/2025 at 11:52 UTC

18 upvotes, 2 top-level comments (showing 2)

Comments

Comment by AutoModerator at 01/02/2025 at 11:52 UTC

1 upvotes, 0 direct replies

Welcome to /r/philosophy! **Please read our updated rules and guidelines[1] before commenting**.

1: https://reddit.com/r/philosophy/comments/14pn2k9/welcome_to_rphilosophy_check_out_our_rules_and/?

/r/philosophy is a subreddit dedicated to discussing philosophy and philosophical issues. To that end, please keep in mind our commenting rules:

CR1: Read/Listen/Watch the Posted Content Before You Reply

Read/watch/listen the posted content, understand and identify the philosophical arguments given, and respond to these substantively. If you have unrelated thoughts or don't wish to read the content, please post your own thread or simply refrain from commenting. Comments which are clearly not in direct response to the posted content may be removed.

CR2: Argue Your Position

Opinions are not valuable here, arguments are! Comments that solely express musings, opinions, beliefs, or assertions without argument may be removed.

CR3: Be Respectful

Comments which consist of personal attacks will be removed. Users with a history of such comments may be banned. Slurs, racism, and bigotry are absolutely not permitted.

Please note that as of July 1 2023, reddit has made it substantially more difficult to moderate subreddits. If you see posts or comments which violate our subreddit rules and guidelines[2], please report them using the report function. For more significant issues, please contact the moderators via modmail[3] (not via private message or chat).

2: https://reddit.com/r/philosophy/comments/14pn2k9/welcome_to_rphilosophy_check_out_our_rules_and/?

3: https://reddit.com/message/compose/?to=/r/philosophy

4: /message/compose/?to=/r/philosophy

Comment by bildramer at 01/02/2025 at 15:21 UTC

3 upvotes, 2 direct replies

At least 90% of the interpretability problem (at least wrt LLMs) comes from propositions being a lossy summary very loosely related to actual facts or behavior. Back in the days of GOFAI, many, many people thought wrongly that you could make the computer get human commonsense knowledge by teaching it enough words, when the real problem is that the computer only really "sees" something like <noun37> <verb82> <noun25>.

Modern LLMs are like aiming trillions of flops of brute force at the problem - it may appear solved to us, the output speech acts appear to be goal-directed and may even be useful sometimes, but the disconnect is still there. The summary happens *after* whatever solution process happens involving whatever unknowable computations in a different latent space. Why believe such a summary is accurate? How does the summarization happen? Answering such (important!) questions is mechanistic interpretability, and propositional interpretability by definition can't answer them.