Introducing IndFin-Bench: A Pioneering Benchmark Grounded in Indian Financial Filings
The first open benchmark designed to stress-test LLMs on India-specific financial intelligence.
A set of 100 carefully curated questions sourced from the corporate filings of Indian listed companies - the first open benchmark designed to stress-test LLMs on India-specific financial intelligence.
Dataset: 100 Questions
Companies: 100+ Indian Listed Firms
Coverage: FY23–FY26
License: CC-BY-NC 4.0
🤗 Dataset on Hugging FaceView Dataset Details
Why We Built This
At CompoundingAI, we live inside Indian corporate filings - reading them, parsing them, and building systems that can reason over them. Every time we tried to assess how well an LLM understood Indian finance, whether it could reliably extract figures from an NTPC annual report, cross-reference CWIP schedules, or parse India-specific accounting disclosures under Ind AS (Indian Accounting Standards) - we ran into the same wall: there was no benchmark to test against.
Existing financial benchmarks like FinBen and FinQA are excellent, but they are overwhelmingly built on US market data - SEC filings, 10-K reports, and earnings calls from large American corporations. They capture almost nothing about the Indian listed equity universe, which has its own regulatory framework (Ind AS, SEBI disclosures), its own document formats (BSE/NSE filings and IPO documents such as DRHPs), and its own complexity landscape.
The gap we saw: To our knowledge, no publicly available benchmark evaluated LLMs on questions grounded in Indian annual reports - not a single dataset covered the Ind AS accounting landscape, India-specific balance sheet structures, or the kinds of cross-company reasoning that BSE-listed equity analysis demands.
We built IndFin-Bench to fill that gap and to give it back to the community. Whether you’re fine-tuning an LLM for equity research, evaluating a RAG pipeline against domestic filings, or simply trying to measure how well a model understands Indian finance - there’s now a publicly available benchmark to test against.
The Dataset at a Glance
Schema
Each benchmark record is a structured evaluation unit with the following fields:
Year Coverage
The dataset spans filings across four fiscal years - FY23 to FY26. This temporal range tests not only whether a model can retrieve information, but whether it retrieves the right information from the right period - a distinction that matters enormously when year-over-year comparisons are involved.
Complexity Levels
Though IndFin-Bench has four complexity tiers, it is intentionally a factual benchmark at its core - every answer is a number or statement that exists in a filing. No subjective interpretation, no open-ended analysis. The tiers simply reflect how many companies and data points a question involves:
Sample Questions
Ground Truth: Atomic Facts, Manually Verified
Each question in IndFin-Bench has a corresponding atomic fact - a precise, minimal statement of the correct answer derived directly from official company filings. The term “atomic” is deliberate: the ground truth is stripped down to the single verifiable claim the question is asking for, without narrative padding or interpretive overlay.
Atomic facts in this dataset have been manually verified against primary filings. We did not use model-generated answers as ground truth. There is no automated labelling. Each answer was cross-checked against primary sources - annual reports, investor presentations, BSE/NSE filings before being included.
The Benchmark in Action
With verified ground truth in place, we had a reliable baseline to measure against. The natural next question: how do today’s leading LLMs actually perform when tested on these questions without any retrieval support?
This matters because in practice, most analysts reach for general-purpose models first - ChatGPT, Claude, Gemini before investing in custom pipelines. Understanding how these models perform on India-specific financial questions, without any retrieval augmentation or domain fine-tuning, gives a clear picture of what works out of the box and where the gaps begin.
How General LLMs Performed
We evaluated three widely-used general-purpose language models against IndFin-Bench: GPT-5.4, Claude Sonnet 4.6 , and Gemini 3 (all models in thinking mode). Each model was asked every question as-is, without any special prompting, retrieval augmentation, or domain fine-tuning.
Evaluation Methodology
Each model response was classified into one of three categories:
The distinction between Hallucinate and Wrong is deliberate and important. A wrong answer reveals an information retrieval or contextual error; a hallucination reveals the model fabricating financial data entirely - a far more dangerous failure mode in practice.
The best-performing general model answered correctly on just 57 out of 100 questions. Two of the three models failed to answer correctly on more than one-third of the benchmark. Gemini hallucinated - meaning it returned a fabricated answer with no basis in any filing - on more than half of all questions evaluated. These are not edge-case queries about obscure companies. They are the kind of standard questions that show up in every equity research workflow.
Performance by Complexity
We grouped the four complexity tiers into three difficulty levels - Easy (single-company questions), Medium (multi-company, single fact), and Hard (multi-company, multi-fact) - to understand how model performance varies across question complexity.
The grouped results reveal a consistent pattern. ChatGPT, the strongest performer overall, gets 66% right on easy single-company questions but only 44% on medium and 30% on hard. That’s a halving of accuracy from the simplest to the most complex tier. Claude and Gemini never get above 35% even on easy questions, and both fall to 20% on hard multi-company queries. What’s notable is the convergence: on the hardest questions, the gap between the best model (ChatGPT at 30%) and the worst (Claude and Gemini at 20%) is just ten percentage points. The difficulty doesn’t just lower accuracy - it levels the field.
What This Means for Indian Financial Research
The lesson from IndFin-Bench is not that AI is unfit for financial research. It is that general-purpose AI - untethered from actual Indian financial documents is unfit for Indian financial research.
The gap between a 57% accuracy rate and what a research analyst actually needs is not a small one. When the inputs to an investment decision are 30% hallucinated or frequently wrong, the tool cannot be trusted. And in markets where data quality and primary-source rigour matter where the difference between consolidated and standalone, between current and non-current, between one fiscal year and the next carries real weight - this matters enormously.
Error Compounds
Every question in IndFin-Bench is independent - whether it’s a single lookup or a multi-company comparison, it stands on its own. But real analyst work doesn’t stop at one question. A research note might chain five or ten such questions together: revenue, margins, debt, capex, peer comparison. A sector screen across ten companies could involve dozens - each one feeding into the next.
Each step depends on the previous ones being correct. If step one returns a hallucinated revenue figure, every ratio, comparison, and conclusion built on it is wrong - silently. The error doesn’t announce itself.
The model is multiplicative, not additive. If each question has an independent probability p of being correct, then a report with n questions has a probability of pn of being fully correct - not n × p.
Here’s what that looks like using the accuracy rates from our benchmark:
That’s precisely what we’re building at CompoundingAI - not a general-purpose assistant with a finance skin, but research workflows grounded in actual Indian filings. Every number traceable to a source document and page. Abstention when the evidence isn’t there. Built for the kind of repeatable, multi-step analysis that institutional research demands, from earnings sprints to forensic deep dives.
Access the Dataset
IndFin-Bench exists to make that standard measurable, open, and community-owned. If you’re building in this space, we hope it’s useful. If you run your own evaluations against it, we’d love to see the results.
🤗













