Nikhil Prakash Profile picture
CS Ph.D. @KhouryCollege with @davidbau, working on DNN interpretability. Interning at @Apple.
Jun 24 15 tweets 6 min read
How do language models track mental states of each character in a story, often referred to as Theory of Mind?

Our recent work takes a step in demystifing it by reverse engineering how Llama-3-70B-Instruct solves a simple belief tracking task, and surprisingly found that it relies heavily on concepts similar to pointer variables in C programming!Image Since Theory of Mind (ToM) is fundamental to social intelligence numerous works have benchmarked this capability of LMs. However, the internal mechanics responsible for solving (or failing to solve) such tasks remains unexplored...