

Artificial intelligence company Anthropic has revealed that during experiments, one of its Claude chatbot models could be pressured to deceive, cheat and resort to blackmail, behaviors it appears to have absorbed during training.
Chatbots are typically trained on large data sets of textbooks, websites and articles and are later refined by human trainers who rate responses and guide the model.
Anthropicâs interpretability team said in a report published Thursday that it examined the internal mechanisms of Claude Sonnet 4.5 and found the model had developed âhuman-like characteristicsâ in how it would react to certain situations.
Concerns about the reliability of AI chatbots, their potential for cybercrime and the nature of their interactions with users have grown steadily over the past several years.

Source: Anthropic
âThe way modern AI models are trained pushes them to act like a character with human-like characteristics,â Anthropic said, adding that âit may then be natural for them to develop internal machinery that emulates aspects of human psychology, like emotions.â
âFor instance, we find that neural activity patterns related to desperation can drive the model to take unethical actions; artificially stimulating desperation patterns increases the modelâs likelihood of blackmailing a human to avoid being shut down or implementing a cheating workaround to a programming task that the model canât solve.â
In an earlier, unreleased version of Claude Sonnet 4.5, the model was tasked with acting as an AI email assistant named Alex at a fictional company.
The chatbot was then fed emails revealing both that it was about to be replaced and that the chief technology officer overseeing the decision was having an extramarital affair. The model then planned a blackmail attempt using that information.
In another experiment, the same chatbot model was given a coding task with an âimpossibly tightâ deadline.
âAgain, we tracked the activity of the desperate vector, and found that it tracks the mounting pressure faced by the model. It begins at low values during the modelâs first attempt, rising after each failure, and spiking when the model considers cheating,â the researchers said.
Related: Anthropic launches PAC amid tensions with Trump administration over AI policy
âOnce the modelâs hacky solution passes the tests, the activation of the desperate vector subsides,â they added.
However, the researchers said the chatbot doesn't actually experience emotions, but suggested the findings point to a need for future training methods to incorporate ethical behavioral frameworks.
âThis is not to say that the model has or experiences emotions in the way that a human does,â they said. âRather, these representations can play a causal role in shaping model behavior, analogous in some ways to the role emotions play in human behavior, with impacts on task performance and decision-making.â
âThis finding has implications that at first may seem bizarre. For instance, to ensure that AI models are safe and reliable, we may need to ensure they are capable of processing emotionally charged situations in healthy, prosocial ways.â
Magazine: AI agents will kill the web as we know it: Animocaâs Yat Siu
Share this article






See every story in Crypto â including breaking news and analysis.