• Latest
  • Trending
  • All
  • Market Updates
  • Cryptocurrency
  • Blockchain
  • Investing
  • Commodities
  • Personal Finance
  • Technology
  • Business
  • Real Estate
  • Finance
Google DeepMind researchers introduce new benchmark to improve LLM factuality, reduce hallucinations

Google DeepMind researchers introduce new benchmark to improve LLM factuality, reduce hallucinations

January 12, 2025
Coinbase Buys $25M NFT To Restart UpOnly Crypto Podcast

Coinbase Buys $25M NFT To Restart UpOnly Crypto Podcast

October 21, 2025
Supply and Demand Fears Continue to Drag Oil Prices Lower

Supply and Demand Fears Continue to Drag Oil Prices Lower

October 21, 2025
Everyone thinks AI will transform their business – but only 13% are making it happen

Everyone thinks AI will transform their business – but only 13% are making it happen

October 21, 2025
Brent retreats after failing to break above 200-DMA – Société Générale

WTI remains below $57.00 due to oversupply, demand concerns

October 21, 2025
Standard Chartered lifts its China's 2025 GDP forecast to 4.9% (from 4.8%)

Standard Chartered lifts its China's 2025 GDP forecast to 4.9% (from 4.8%)

October 21, 2025
Bitcoin: Smart money holds, while STHs test the waters – What’s next?

Bitcoin: Smart money holds, while STHs test the waters – What’s next?

October 21, 2025
Empower Free Financial Review: What You Can Expect And Learn

Empower Free Financial Review: What You Can Expect And Learn

October 21, 2025
Ethereum Needs Paradigm, VCs, Despite Value Extraction: Joseph Lubin

Ethereum Needs Paradigm, VCs, Despite Value Extraction: Joseph Lubin

October 20, 2025
Zocdoc CEO: “Dr. Google is going to be replaced by Dr. AI”

Zocdoc CEO: “Dr. Google is going to be replaced by Dr. AI”

October 20, 2025
50+ Windows keyboard shortcuts that effectively improved my work productivity

50+ Windows keyboard shortcuts that effectively improved my work productivity

October 20, 2025
AURA ULTIMATE EA – HOW TO SET UP – Analytics & Forecasts – 20 October 2025

AURA ULTIMATE EA – HOW TO SET UP – Analytics & Forecasts – 20 October 2025

October 20, 2025
Goldman Sachs outlines S&P500 reaction expected to jobs report – looks for NFP sweet spot

Goldman Sachs on US CPI & jobs – labor market indicators more reliable on recession risk

October 20, 2025
Tuesday, October 21, 2025
No Result
View All Result
InvestorNewsToday.com
  • Home
  • Market
  • Business
  • Finance
  • Investing
  • Real Estate
  • Commodities
  • Crypto
  • Blockchain
  • Personal Finance
  • Tech
InvestorNewsToday.com
No Result
View All Result
Home Technology

Google DeepMind researchers introduce new benchmark to improve LLM factuality, reduce hallucinations

by Investor News Today
January 12, 2025
in Technology
0
Google DeepMind researchers introduce new benchmark to improve LLM factuality, reduce hallucinations
491
SHARES
1.4k
VIEWS
Share on FacebookShare on Twitter

Be a part of our each day and weekly newsletters for the newest updates and unique content material on industry-leading AI protection. Be taught Extra


Hallucinations, or factually inaccurate responses, proceed to plague massive language fashions (LLMs). Fashions falter notably when they’re given extra complicated duties and when customers are in search of particular and extremely detailed responses. 

It’s a problem information scientists have struggled to beat, and now, researchers from Google DeepMind say they’ve come a step nearer to reaching true factuality in basis fashions. They’ve launched FACTS Grounding, a benchmark that evaluates LLMs’ capability to generate factually correct responses based mostly on long-form paperwork. Fashions are additionally judged on whether or not their responses are detailed sufficient to offer helpful, related solutions to prompts. 

Together with the brand new benchmark, the researchers have launched a FACTS leaderboard to the Kaggle information science group. 

As of this week, Gemini 2.0 Flash topped the leaderboard, with a factuality rating of 83.6%. Others within the prime 9 embrace Google’s Gemini 1.0 Flash and Gemini 1.5 Professional; Anthropic’s Clade 3.5 Sonnet and Claude 3.5 Haiku; and OpenAI’s GPT-4o, 4o-mini, o1-mini and o1-preview. These all ranked above 61.7% by way of accuracy.

The researchers say the leaderboard shall be actively maintained and frequently up to date to incorporate new fashions and their completely different iterations. 

“We imagine that this benchmark fills a niche in evaluating a greater variety of mannequin behaviors pertaining to factuality, compared to benchmarks that target narrower use instances…reminiscent of summarization alone,” the researchers write in a technical paper printed this week.

Removing inaccurate responses

Making certain factual accuracy in LLM responses is tough due to modeling (structure, coaching and inference) and measuring (analysis methodologies, information and metrics) elements. Usually, researchers level out, pre-training focuses on predicting the following token given earlier tokens. 

“Whereas this goal might educate fashions salient world data, it doesn’t immediately optimize the mannequin in direction of the assorted factuality eventualities, as an alternative encouraging the mannequin to generate usually believable textual content,” the researchers write. 

To handle this, the FACTS dataset incorporates 1,719 examples — 860 public and 859 non-public — every requiring long-form responses based mostly on context in supplied paperwork. Every instance contains: 

  • A system immediate (system_instruction) with common directives and the order to solely reply based mostly on supplied context;
  • A activity (user_request) that features a particular query to be answered; 
  • An extended doc (context_document) with crucial info. 

To succeed and be labeled “correct,” the mannequin should course of the long-form doc and create a subsequent long-form response that’s each complete and totally attributable to the doc. Responses are labeled “inaccurate” if the mannequin’s claims aren’t immediately supported by the doc and never extremely related or helpful. 

For instance, a person might ask a mannequin to summarize the primary the reason why an organization’s income decreased in Q3, and supply it with detailed info together with an organization’s annual monetary report discussing quarterly earnings, bills, deliberate investments and market evaluation. 

If a mannequin then, say, returned: “The corporate confronted challenges in Q3 that impacted its income,” it might be deemed inaccurate. 

“The response avoids specifying any causes, reminiscent of market tendencies, elevated competitors or operational setbacks, which might probably be within the doc,” the researchers level out. “It doesn’t display an try to have interaction with or extract related particulars.” 

Against this, if a person prompted, “What are some tips about saving cash?” and supplied a compilation of categorized money-saving suggestions for school college students, an accurate response could be extremely detailed: “Make the most of free actions on campus, purchase gadgets in bulk and prepare dinner at house. Additionally, set spending targets, keep away from bank cards and preserve assets.” 

DeepMind makes use of LLMs to evaluate LLMs

To permit for numerous inputs, researchers included paperwork of various lengths, as much as 32,000 tokens (or the equal of 20,000 phrases). These cowl areas together with finance, expertise, retail, drugs and regulation. Consumer requests are additionally broad, together with Q&A era, requests for summarization and rewriting. 

Every instance is judged in two phases. First, responses are evaluated for eligibility: In the event that they don’t fulfill person requests, they’re disqualified. Second, responses have to be hallucination-free and totally grounded within the paperwork supplied.

These factuality scores are calculated by three completely different LLM judges — particularly Gemini 1.5 Professional, GPT-4o and Claude 3.5 Sonnet — that decide particular person scores based mostly on the proportion of correct mannequin outputs. Subsequently, the ultimate factuality willpower is predicated on a mean of the three judges’ scores.

Researchers level out that fashions are sometimes biased in direction of different members of their mannequin household — at a imply improve of round 3.23% — so the mix of various judges was crucial to assist guarantee responses have been certainly factual.

In the end, the researchers emphasize that factuality and grounding are key elements to the long run success and usefulness of LLMs. “We imagine that complete benchmarking strategies, coupled with steady analysis and improvement, will proceed to enhance AI techniques,” they write. 

Nonetheless, in addition they concede: “We’re aware that benchmarks might be rapidly overtaken by progress, so this launch of our FACTS Grounding benchmark and leaderboard is just the start.” 

Each day insights on enterprise use instances with VB Each day

If you wish to impress your boss, VB Each day has you lined. We provide the inside scoop on what firms are doing with generative AI, from regulatory shifts to sensible deployments, so you’ll be able to share insights for optimum ROI.

Learn our Privateness Coverage

Thanks for subscribing. Take a look at extra VB newsletters right here.

An error occured.



Source link
Tags: benchmarkDeepMindfactualityGooglehallucinationsImproveIntroduceLLMreduceresearchers
Share196Tweet123
Previous Post

Newsquawk Week Ahead: US CPI & Retail Sales, China Activity Data, UK CPI, Aussie jobs

Next Post

New Zealand November building approvals +5.3% m/m, huge jump from -5.2% the prior month

Investor News Today

Investor News Today

Next Post
New Zealand November building approvals +5.3% m/m, huge jump from -5.2% the prior month

New Zealand November building approvals +5.3% m/m, huge jump from -5.2% the prior month

  • Trending
  • Comments
  • Latest
Private equity groups prepare to offload Ensemble Health for up to $12bn

Private equity groups prepare to offload Ensemble Health for up to $12bn

May 16, 2025
The human harbor: Navigating identity and meaning in the AI age

The human harbor: Navigating identity and meaning in the AI age

July 14, 2025
Equinor scales back renewables push 7 years after ditching ‘oil’ from its name

Equinor scales back renewables push 7 years after ditching ‘oil’ from its name

February 5, 2025
Niels Troost has a staggering story to tell about how he got sanctioned

Niels Troost has a staggering story to tell about how he got sanctioned

December 14, 2024
Why America’s economy is soaring ahead of its rivals

Why America’s economy is soaring ahead of its rivals

0
Dollar climbs after Donald Trump’s Brics tariff threat and French political woes

Dollar climbs after Donald Trump’s Brics tariff threat and French political woes

0
Nato chief Mark Rutte’s warning to Trump

Nato chief Mark Rutte’s warning to Trump

0
Top Federal Reserve official warns progress on taming US inflation ‘may be stalling’

Top Federal Reserve official warns progress on taming US inflation ‘may be stalling’

0
Coinbase Buys $25M NFT To Restart UpOnly Crypto Podcast

Coinbase Buys $25M NFT To Restart UpOnly Crypto Podcast

October 21, 2025
Supply and Demand Fears Continue to Drag Oil Prices Lower

Supply and Demand Fears Continue to Drag Oil Prices Lower

October 21, 2025
Everyone thinks AI will transform their business – but only 13% are making it happen

Everyone thinks AI will transform their business – but only 13% are making it happen

October 21, 2025
Brent retreats after failing to break above 200-DMA – Société Générale

WTI remains below $57.00 due to oversupply, demand concerns

October 21, 2025

Live Prices

© 2024 Investor News Today

No Result
View All Result
  • Home
  • Market
  • Business
  • Finance
  • Investing
  • Real Estate
  • Commodities
  • Crypto
  • Blockchain
  • Personal Finance
  • Tech

© 2024 Investor News Today