Tomer Ashuach

Technion – Israel Institute of Technology

I’m a PhD student at the Technion, advised by Prof Yonatan Belinkov. My research focuses on the interpretability of language models, with particular emphasis on uncovering their internal mechanisms and understanding how knowledge is acquired and can be unlearned.

Research Interests

Interpretability in LLMs
Knowledge and Unlearning in LLMs
AI Safety and Alignment

Publications

2026

ACL 2026

Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness

Tomer Ashuach, Liat Ein-Dor, Shai Gretz, and 2 more authors

In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics: ACL 2026, Dec 2026

ArXiv Project Website

We explore whether large language models possess privileged internal knowledge about their own correctness, analogous to human introspection. By evaluating models on conflicting predictions, we discover that LLMs exhibit a domain-specific intuition for factual knowledge tasks that external observers cannot access.
ACL 2026

CRISP: Persistent Concept Unlearning via Sparse Autoencoders

Tomer Ashuach, Dana Arad, Aaron Mueller, and 2 more authors

In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics: ACL 2026, Jan 2026

ArXiv Code Project Website

We introduce CRISP, a parameter-efficient method leveraging sparse autoencoders to achieve persistent concept unlearning in LLMs. By automatically identifying and suppressing salient features, CRISP successfully removes harmful knowledge while maintaining model utility.

2025

ACL Findings 2025

REVS: Unlearning Sensitive Information in Language Models via Rank Editing in the Vocabulary Space

Tomer Ashuach, Martin Tutek, and Yonatan Belinkov

In Findings of the Association for Computational Linguistics: ACL 2025, May 2025

ArXiv Code Project Website

We propose REVS, a non-gradient-based method that unlearns sensitive information from language models by identifying and modifying a small subset of relevant neurons. REVS achieves robust unlearning and resists extraction attacks while preserving the model’s underlying integrity.