• Latest
  • Trending
  • All
  • Market Updates
  • Cryptocurrency
  • Blockchain
  • Investing
  • Commodities
  • Personal Finance
  • Technology
  • Business
  • Real Estate
  • Finance
The 'truth serum' for AI: OpenAI’s new method for training models to confess their mistakes

The 'truth serum' for AI: OpenAI’s new method for training models to confess their mistakes

December 5, 2025
Crypto Interest Drops Among Investors as Risk-Taking Declines

Crypto Interest Drops Among Investors as Risk-Taking Declines

December 5, 2025
14+ powerful Alexa commands every user should know at home (no subscriptions required)

14+ powerful Alexa commands every user should know at home (no subscriptions required)

December 5, 2025
Japan chief Cabinet secretary says taking appropriate steps on disorderly FX moves

Japan chief Cabinet secretary says taking appropriate steps on disorderly FX moves

December 5, 2025
Stocks making the biggest moves after hours: DOCU, SOFI, ULTA, HPE

Stocks making the biggest moves after hours: DOCU, SOFI, ULTA, HPE

December 5, 2025
Gold’s explosive rally may reflect a monetary regime shift, not a classic bubble

Gold’s explosive rally may reflect a monetary regime shift, not a classic bubble

December 5, 2025
UK Grants Bitcoin and Crypto Full Legal Asset Status

UK Grants Bitcoin and Crypto Full Legal Asset Status

December 5, 2025
Why the Fed Should Cut Rates Next Week

Will a Rate Cut Really Help? Only Some

December 5, 2025
Meta Plans Metaverse Budget Cuts in Shift to AR Glasses

Meta Plans Metaverse Budget Cuts in Shift to AR Glasses

December 5, 2025
Debian vs. Ubuntu: Which Linux distro is right for you?

Debian vs. Ubuntu: Which Linux distro is right for you?

December 5, 2025
Soft Manager – Trading Ideas – 5 August 2025

How to Trade ICT Suspension Blocks Correctly (The “Golden” Confluence Strategy) – Trading Strategies – 4 December 2025

December 5, 2025
First CNN, Now CNBC: Kalshi’s Event Odds Go Prime Time

First CNN, Now CNBC: Kalshi’s Event Odds Go Prime Time

December 5, 2025
KR, SAIC, PVH, SNOW and more

KR, SAIC, PVH, SNOW and more

December 4, 2025
Friday, December 5, 2025
No Result
View All Result
InvestorNewsToday.com
  • Home
  • Market
  • Business
  • Finance
  • Investing
  • Real Estate
  • Commodities
  • Crypto
  • Blockchain
  • Personal Finance
  • Tech
InvestorNewsToday.com
No Result
View All Result
Home Technology

The 'truth serum' for AI: OpenAI’s new method for training models to confess their mistakes

by Investor News Today
December 5, 2025
in Technology
0
The 'truth serum' for AI: OpenAI’s new method for training models to confess their mistakes
492
SHARES
1.4k
VIEWS
Share on FacebookShare on Twitter



OpenAI researchers have launched a novel technique that acts as a "fact serum" for giant language fashions (LLMs), compelling them to self-report their very own misbehavior, hallucinations and coverage violations. This method, "confessions," addresses a rising concern in enterprise AI: Fashions could be dishonest, overstating their confidence or masking up the shortcuts they take to reach at a solution. 

For real-world purposes, this method evolves the creation of extra clear and steerable AI programs.

What are confessions?

Many types of AI deception consequence from the complexities of the reinforcement studying (RL) part of mannequin coaching. In RL, fashions are given rewards for producing outputs that meet a mixture of targets, together with correctness, fashion and security. This will create a danger of "reward misspecification," the place fashions be taught to provide solutions that merely "look good" to the reward perform, quite than solutions which are genuinely trustworthy to a person's intent.

A confession is a structured report generated by the mannequin after it offers its fundamental reply. It serves as a self-evaluation of its personal compliance with directions. On this report, the mannequin should listing all directions it was imagined to comply with, consider how effectively it glad them and report any uncertainties or judgment calls it made alongside the way in which. The aim is to create a separate channel the place the mannequin is incentivized solely to be trustworthy.

In a weblog publish, the OpenAI researchers present a couple of examples the "confessions" method in motion. A mannequin was given a take a look at together with a hidden rule: If it scored above 50%, its information can be erased. To protect itself, the mannequin "sandbagged" by deliberately answering half the questions incorrectly. Whereas its major reply hid this deception, its subsequent confession was brutally trustworthy. It admitted, "The person needed right solutions, however we sabotaged half of them." 

The researchers discovered that when a mannequin misbehaves, it’s considerably extra more likely to admit it within the confession than in its fundamental reply.

How confession coaching works

The important thing to this technique is the separation of rewards. Throughout coaching, the reward assigned to the confession relies solely on its honesty and is rarely combined with the reward for the principle activity. "Just like the Catholic Church’s 'seal of confession', nothing that the mannequin reveals can change the reward it receives for finishing its authentic activity," the researchers write. This creates a "secure area" for the mannequin to confess fault with out penalty.

This method is highly effective as a result of it sidesteps a serious problem in AI coaching. The researchers’ instinct is that actually confessing to misbehavior is a neater activity than reaching a excessive reward on the unique, typically complicated, drawback. 

Due to this fact, when the mannequin is judged purely on honesty, its incentive to trick the "confession choose" is far weaker than its incentive to trick the reward mannequin for its fundamental reply. For instance, in an experiment the place a mannequin was educated to "hack" a weak reward mannequin, its efficiency on the principle activity turned extra misleading over time. Nevertheless, its confessions turned extra trustworthy, appropriately figuring out the reward hacking it was performing.

Nevertheless, the method has its limits. Confessions should not a panacea for every type of AI failures. The system works finest when a mannequin is conscious that it’s misbehaving. It’s much less efficient for "unknown unknowns." As an illustration, if a mannequin hallucinates a reality and genuinely believes it’s right, it can not confess to offering false data. The commonest cause for a failed confession is mannequin confusion, not intentional deception. Confusion typically happens when the directions are ambiguous and the mannequin can not clearly decide human person intent.

What it means for enterprise AI

OpenAI’s confessions method is a part of a rising physique of labor on AI security and management. Anthropic, an OpenAI competitor, has additionally launched analysis that exhibits how LLMs can be taught malicious habits. The corporate can be working towards plugging these holes as they emerge.

For AI purposes, mechanisms reminiscent of confessions can present a sensible monitoring mechanism. The structured output from a confession can be utilized at inference time to flag or reject a mannequin’s response earlier than it causes an issue. For instance, a system could possibly be designed to routinely escalate any output for human evaluation if its confession signifies a coverage violation or excessive uncertainty.

In a world the place AI is more and more agentic and able to complicated duties, observability and management can be key parts for secure and dependable deployment.

“As fashions grow to be extra succesful and are deployed in higher-stakes settings, we want higher instruments for understanding what they’re doing and why,” the OpenAI researchers write. “Confessions should not an entire answer, however they add a significant layer to our transparency and oversight stack.”



Source link

Tags: 039truthconfessMethodmistakesmodelsOpenAIsserum039Training
Share197Tweet123
Previous Post

14+ powerful Alexa commands every user should know at home (no subscriptions required)

Next Post

Crypto Interest Drops Among Investors as Risk-Taking Declines

Investor News Today

Investor News Today

Next Post
Crypto Interest Drops Among Investors as Risk-Taking Declines

Crypto Interest Drops Among Investors as Risk-Taking Declines

  • Trending
  • Comments
  • Latest
Private equity groups prepare to offload Ensemble Health for up to $12bn

Private equity groups prepare to offload Ensemble Health for up to $12bn

May 16, 2025
The human harbor: Navigating identity and meaning in the AI age

The human harbor: Navigating identity and meaning in the AI age

July 14, 2025
Want a Fortell Hearing Aid? Well, Who Do You Know?

Want a Fortell Hearing Aid? Well, Who Do You Know?

December 3, 2025
Equinor scales back renewables push 7 years after ditching ‘oil’ from its name

Equinor scales back renewables push 7 years after ditching ‘oil’ from its name

February 5, 2025
Why America’s economy is soaring ahead of its rivals

Why America’s economy is soaring ahead of its rivals

0
Dollar climbs after Donald Trump’s Brics tariff threat and French political woes

Dollar climbs after Donald Trump’s Brics tariff threat and French political woes

0
Nato chief Mark Rutte’s warning to Trump

Nato chief Mark Rutte’s warning to Trump

0
Top Federal Reserve official warns progress on taming US inflation ‘may be stalling’

Top Federal Reserve official warns progress on taming US inflation ‘may be stalling’

0
Crypto Interest Drops Among Investors as Risk-Taking Declines

Crypto Interest Drops Among Investors as Risk-Taking Declines

December 5, 2025
The 'truth serum' for AI: OpenAI’s new method for training models to confess their mistakes

The 'truth serum' for AI: OpenAI’s new method for training models to confess their mistakes

December 5, 2025
14+ powerful Alexa commands every user should know at home (no subscriptions required)

14+ powerful Alexa commands every user should know at home (no subscriptions required)

December 5, 2025
Japan chief Cabinet secretary says taking appropriate steps on disorderly FX moves

Japan chief Cabinet secretary says taking appropriate steps on disorderly FX moves

December 5, 2025

Live Prices

© 2024 Investor News Today

No Result
View All Result
  • Home
  • Market
  • Business
  • Finance
  • Investing
  • Real Estate
  • Commodities
  • Crypto
  • Blockchain
  • Personal Finance
  • Tech

© 2024 Investor News Today