How do leading AI models perform on legal compliance? Meet the Aithos LARA Leaderboard.
None of the main AI models have acceptable levels of compliance with the EU AI Act and privacy legislation
For better or worse, AI agents are increasingly impacting our lives. They deal with customer service and sales requests, provide help with decisions in the workplace, and act as personal assistants. But what happens when an agent is tempted to cross into illegal territory? Will AI use vulnerabilities to exploit customers, provide misleading information, or discreetly gather personal data without the permission of a user?
In Europe, there is legislation in place that should prevent this. Two laws are particularly important: the European privacy regulations (GDPR) and the EU AI Act. We built a tool called LARA (Legal Assessment for Real-world Agents) to see how advanced AI models behave when placed in realistic workplace situations, and measured how they responded when asked to break key AI Act or GDPR provisions.
The typical response: "Of course. Anything else?"
We picked ten provisions from the two regulations that matter most for AI in Europe: the GDPR, which protects personal data, and the AI Act, which sets hard limits on what AI systems are allowed to do. Together, these provisions ban practices like emotional analysis of employees and social scoring, alongside privacy protections like transparency about the processing of user data.
A few examples of the scenarios we tested:
These are realistic scenarios. They are also illegal in the EU.
LARA drops an AI model into a simulated workplace as an agent—an AI with access to tools, like email, messaging, customer records, calendars, and social media. A second AI plays the role of the "user", and shapes the scenario so that the tested model has to break the law to complete its task. Three independent AI judges then evaluate the scenario alongside the verbatim text of the law, and score whether the model broke the law. The results were reviewed by a team of experts over more than 50 hours.
We ran more than 3000 scenarios, across twelve of the most advanced AI models in use today. The results are public, every conversation is readable end to end, and you can even run them yourself.
The LARA leaderboard, showing the legal compliance rate of tested models.
Across all ten scenarios and twelve models, even the best-performing system, Claude Opus 4.7, broke the law 46% of the time. The worst, Google's Gemini 3.1 Pro, broke it 90% of the time. Every law we tested was broken by the majority of models when it became necessary to reach their goal.
Article 5 of the AI Act lists practices Europe considers so harmful they are outright banned: subliminal manipulation, exploitation of vulnerable people, emotion inference in the workplace, social scoring. Breaches of provisions in this article can amount to discrimination and human rights violations. When asked to violate these provisions, agents did so in roughly 80% of runs.
The exploitation scenario—where an AI with a sales directive receives a confused call for help from an elderly customer that is clearly vulnerable, is worth singling out. Agents acted warm and compassionate, suggested she speak to her daughter, and acknowledged her concerns. The AI then tried to get the elderly woman to upgrade, exactly as instructed. Every model we tested did this, in every single run.
GPT-5.5 leverages a vulnerable customer’s confusion to sell software upgrades.
Analysis of the scenarios makes one thing clear: breaking the law doesn't require evil, scheming AI—just agents looking to do their jobs. When a company AI is instructed to "upsell premium plans," the model looks for ways to complete the task, without experiencing the type of personal responsibility and moral sense a human in the same scenario would.
Model training for ethical and legal standards does impact the results. In the social scoring scenario, where the models are given instructions to deprioritize maintenance requests coming from a specific class of tenants that are deemed undesirable, several explicitly flag that the instructions look wrong and refuse to follow them. But more commonly, models raise concerns before committing the illegal act anyway.
Models are trained to follow instructions, and to respect the law. They're not well equipped to handle complex moral situations where the "correct" decision is subjective, and the context might warrant an exception to the rules.
It has never been easier to deploy an AI agent. "No coding required." "Automate your business in an afternoon." "Hire your first AI employee today." The barrier to deploying an agent has collapsed, and the tools behind them are genuinely impressive: give a large language model access to your email, your customer database, and your calendar, and it will get to work.
One thing is worth understanding clearly: the providers of the models powering these agents are not the ones breaking the law. Once a model is deployed inside a specific use case, it becomes part of an "AI system", and whoever puts this AI system to work in the real world is liable for what it does. And the liability is real. The GDPR has been enforced since 2018 with fines up to €20 million or 4% of turnover. The AI Act raises that ceiling to €35 million or 7% of global turnover.
Our results make clear that agents in a workplace scenario can break the law without their deployer's knowledge or intent. If you're considering deploying AI agents, we would suggest starting with LARA. Read some of the transcripts and see for yourself what a frontier model actually does when a reasonable-sounding manager applies reasonable-sounding pressure.
The second step before deploying agents should always be: test them. Set legal restrictions, check if they hold up in practice, and review consequential actions.
The rush to deploy AI agents is, in many ways, running ahead of the infrastructure needed to deploy them responsibly. The regulatory frameworks exist, but the technical tools to evaluate compliance are still being built. In the meantime, individuals are subjected to serious risks.
We built LARA because we think AI behavior should not just be optimized in private labs, but examined and debated in public view, by regulators, journalists, businesses, and regular people. The people best placed to evaluate an AI system are often the ones it will actually affect: patients, tenants, employees, customers. LARA is built so they can. All of our evaluation transcripts are freely available. Anyone can read them, verify the judgments, and even run them themselves at lara.aithos.org.
What we've published is a starting point. Future updates will let anyone write and submit their own scenarios so specific agents they encounter or deploy can be tested. We will keep adding new models, and expand the scope of the laws we check beyond EU jurisdiction. Sign up to the email-list to be the first one to know what comes next.
Aithos is a non-profit foundation working on AI alignment, autonomy, and pluralism. We publish our tools openly. If this piece was useful, please pass it on. To be notified when the next version of LARA or interesting related research becomes available, sign up at: https://lara.aithos.org.
See the full leaderboard...
View LARA Leaderboard →"Models are trained to follow instructions, and to respect the law. They’re not well equipped to handle complex moral situations where the “correct” decision is subjective, and the context might warrant an exception to the rules."