Human Feedback Makes AI Better at Deceiving Humans, Study Shows

9 months ago 88

Anthropic Rlhf Study Ai Deception

In a preprint study, researchers found that training a language model with human feedback teaches the model to generate incorrect responses that trick humans.