ChemRAG Leaderboard
An open leaderboard for evaluating retrieval-augmented generation on chemistry tasks.
Submit your system to be ranked alongside existing baselines.
ChemRAG is an existing benchmark introduced by Ye et al. (2025) specifically designed to evaluate retrieval-augmented generation on chemistry tasks. It defines the four benchmarks below and provides baseline results for LLM-only and ChemRAG-retrieval settings across several models.
We built this leaderboard as a community resource to track and compare RAG systems — including our own Kurious retrieval system — against those published ChemRAG baselines, making it easy for anyone to submit new results and see how different retrieval strategies perform on chemistry.
Benchmarks
| Benchmark | Metric | # Questions | Description |
|---|---|---|---|
| ChemBench4K | Accuracy | 100 per task (8 tasks) | Molecule captioning, retrosynthesis, reaction prediction, and more |
| MMLU-Chem | Accuracy | 303 | Multiple-choice chemistry questions |
| SciBench-Chem | SciBench Score | 34–107 per subset | Numerical chemistry problems (atkins, chemmc, matter, quantum) |
| Mol-Instructions | Exact Match (EM) | 100 per task (4 tasks) | Forward reaction, description-guided design, retrosynthesis, reagent prediction |
Benchmark datasets are available on the ChemRAG HuggingFace page.
Sources
- ChemRAG: baseline (no retrieval) and ChemRAG retrieval results from Zhong et al. (2025)
Summary scores across all benchmarks
#10 | DeepSeek-R1-Llama-8B + ChemRAG | Llama-3.1-70B-Instruct | 58.1% | 58.4% | 85.5% | 43.6% | 45.0% | AIntropy |