https://arxiv.org/pdf/2501.15740
created by F0urLeafCl0ver on 01/02/2025 at 11:52 UTC
18 upvotes, 2 top-level comments (showing 2)
Comment by AutoModerator at 01/02/2025 at 11:52 UTC
1 upvotes, 0 direct replies
Welcome to /r/philosophy! **Please read our updated rules and guidelines[1] before commenting**.
1: https://reddit.com/r/philosophy/comments/14pn2k9/welcome_to_rphilosophy_check_out_our_rules_and/?
/r/philosophy is a subreddit dedicated to discussing philosophy and philosophical issues. To that end, please keep in mind our commenting rules:
Read/watch/listen the posted content, understand and identify the philosophical arguments given, and respond to these substantively. If you have unrelated thoughts or don't wish to read the content, please post your own thread or simply refrain from commenting. Comments which are clearly not in direct response to the posted content may be removed.
Opinions are not valuable here, arguments are! Comments that solely express musings, opinions, beliefs, or assertions without argument may be removed.
Comments which consist of personal attacks will be removed. Users with a history of such comments may be banned. Slurs, racism, and bigotry are absolutely not permitted.
Please note that as of July 1 2023, reddit has made it substantially more difficult to moderate subreddits. If you see posts or comments which violate our subreddit rules and guidelines[2], please report them using the report function. For more significant issues, please contact the moderators via modmail[3] (not via private message or chat).
2: https://reddit.com/r/philosophy/comments/14pn2k9/welcome_to_rphilosophy_check_out_our_rules_and/?
3: https://reddit.com/message/compose/?to=/r/philosophy
4: /message/compose/?to=/r/philosophy
Comment by bildramer at 01/02/2025 at 15:21 UTC
3 upvotes, 2 direct replies
At least 90% of the interpretability problem (at least wrt LLMs) comes from propositions being a lossy summary very loosely related to actual facts or behavior. Back in the days of GOFAI, many, many people thought wrongly that you could make the computer get human commonsense knowledge by teaching it enough words, when the real problem is that the computer only really "sees" something like <noun37> <verb82> <noun25>.
Modern LLMs are like aiming trillions of flops of brute force at the problem - it may appear solved to us, the output speech acts appear to be goal-directed and may even be useful sometimes, but the disconnect is still there. The summary happens *after* whatever solution process happens involving whatever unknowable computations in a different latent space. Why believe such a summary is accurate? How does the summarization happen? Answering such (important!) questions is mechanistic interpretability, and propositional interpretability by definition can't answer them.