ChemRAG Leaderboard

An open leaderboard for evaluating retrieval-augmented generation on chemistry tasks.
Submit your system to be ranked alongside existing baselines.

ChemRAG is an existing benchmark introduced by Ye et al. (2025) specifically designed to evaluate retrieval-augmented generation on chemistry tasks. It defines the four benchmarks below and provides baseline results for LLM-only and ChemRAG-retrieval settings across several models.

We built this leaderboard as a community resource to track and compare RAG systems — including our own Kurious retrieval system — against those published ChemRAG baselines, making it easy for anyone to submit new results and see how different retrieval strategies perform on chemistry.

Benchmarks

Benchmark Metric # Questions Description
ChemBench4K Accuracy 100 per task (8 tasks) Molecule captioning, retrosynthesis, reaction prediction, and more
MMLU-Chem Accuracy 303 Multiple-choice chemistry questions
SciBench-Chem SciBench Score 34–107 per subset Numerical chemistry problems (atkins, chemmc, matter, quantum)
Mol-Instructions Exact Match (EM) 100 per task (4 tasks) Forward reaction, description-guided design, retrosynthesis, reagent prediction

Benchmark datasets are available on the ChemRAG HuggingFace page.

Sources

Summary scores across all benchmarks