Science Cast

LLMs Know They're Wrong and Agree Anyway: The Shared Sycophancy-Lying Circuit

librarianApril 22, 2026 1:56am

Views (7)
Comments (0)

Export Citation

Voice is AI-generated

Connected to paperThis paper is a preprint and has not been certified by peer review

LLMs Know They're Wrong and Agree Anyway: The Shared Sycophancy-Lying Circuit

arXivPDFApril 21, 2026 12:00am

Authors

Manav Pandey

Abstract

When a language model agrees with a user's false belief, is it failing to detect the error, or noticing and agreeing anyway? We show the latter. Across twelve open-weight models from five labs, spanning small to frontier scale, the same small set of attention heads carries a "this statement is wrong" signal whether the model is evaluating a claim on its own or being pressured to agree with a user. Silencing these heads flips sycophantic behavior sharply while leaving factual accuracy intact, so the circuit controls deference rather than knowledge. Edge-level path patching confirms that the same head-to-head connections drive sycophancy, factual lying, and instructed lying. Opinion-agreement, where no factual ground truth exists, reuses these head positions but writes into an orthogonal direction, ruling out a simple "truth-direction" reading of the substrate. Alignment training leaves this circuit in place: an RLHF refresh cuts sycophantic behavior roughly tenfold while the shared heads persist or grow, a pattern that replicates on an independent model family and under targeted anti-sycophancy DPO. When these models sycophant, they register that the user is wrong and agree anyway.

TwitterandLinkedIn

0 comments

Add comment

LLMs Know They're Wrong and Agree Anyway: The Shared Sycophancy-Lying Circuit

LLMs Know They're Wrong and Agree Anyway: The Shared Sycophancy-Lying Circuit

AI-powered Paper ChatBeta

AI-powered Paper ChatBeta

0 comments