Gradio

ChemRAG is an existing benchmark introduced by Ye et al. (2025) specifically designed to evaluate retrieval-augmented generation on chemistry tasks. It defines the four benchmarks below and provides baseline results for LLM-only and ChemRAG-retrieval settings across several models.

We built this leaderboard as a community resource to track and compare RAG systems — including our own Kurious retrieval system — against those published ChemRAG baselines, making it easy for anyone to submit new results and see how different retrieval strategies perform on chemistry.

Benchmarks

Benchmark	Metric	# Questions	Description
ChemBench4K	Accuracy	100 per task (8 tasks)	Molecule captioning, retrosynthesis, reaction prediction, and more
MMLU-Chem	Accuracy	303	Multiple-choice chemistry questions
SciBench-Chem	SciBench Score	34–107 per subset	Numerical chemistry problems (atkins, chemmc, matter, quantum)
Mol-Instructions	Exact Match (EM)	100 per task (4 tasks)	Forward reaction, description-guided design, retrosynthesis, reagent prediction

Benchmark datasets are available on the ChemRAG HuggingFace page.

Sources

ChemRAG: baseline (no retrieval) and ChemRAG retrieval results from Zhong et al. (2025)


#1	o1 + ChemRAG	o1	58.1%	58.4%	85.5%	43.6%	45.0%	ChemRAG
#2	Kurious (Llama-3.1-70B)	Llama-3.1-70B-Instruct	51.9%	58.6%	66.0%	18.6%	64.5%	AIntropy
#3	GPT-4o + ChemRAG	GPT-4o	50.7%	67.2%	73.9%	8.6%	53.2%	ChemRAG
#4	o1 (No RAG)	o1	50.0%	41.6%	85.8%	40.8%	31.6%	ChemRAG
#5	GPT-4o (No RAG)	GPT-4o	42.0%	59.5%	74.6%	5.0%	28.8%	ChemRAG
#6	GPT-3.5 + ChemRAG	GPT-3.5-Turbo	38.0%	44.5%	52.8%	8.8%	45.8%	ChemRAG
#7	Llama-3.1-70B + ChemRAG	Llama-3.1-70B-Instruct	37.6%	26.2%	61.1%	13.6%	49.7%	ChemRAG
#8	Kurious (Llama-3.1-8B)	Llama-3.1-8B-Instruct	32.6%	36.8%	49.8%	7.5%	36.5%	AIntropy
#9	Llama-3.1-8B + ChemRAG	Llama-3.1-8B-Instruct	30.7%	25.9%	52.1%	3.6%	41.0%	ChemRAG
#10	Llama-3.1-70B (No RAG)	Llama-3.1-70B-Instruct	30.2%	24.2%	62.4%	6.0%	28.3%	ChemRAG
#11	GPT-3.5 (No RAG)	GPT-3.5-Turbo	29.6%	30.5%	49.2%	9.7%	29.0%	ChemRAG
#12	DeepSeek-R1-Llama-8B + ChemRAG	DeepSeek-R1-Llama-8B	24.6%	29.1%	57.4%	3.9%	7.8%	ChemRAG
#13	DeepSeek-R1-Llama-8B (No RAG)	DeepSeek-R1-Llama-8B	24.4%	35.4%	55.4%	3.1%	3.8%	ChemRAG
#14	Llama-3.1-8B (No RAG)	Llama-3.1-8B-Instruct	24.4%	27.3%	42.9%	3.3%	24.0%	ChemRAG
#15	Mistral-7B + ChemRAG	Mistral-7B-Instruct	22.1%	11.1%	42.6%	0.0%	34.7%	ChemRAG
#16	ChemLLM-7B (No RAG)	ChemLLM-7B	21.9%	23.5%	37.6%	8.7%	17.7%	ChemRAG
#17	ChemLLM-7B + ChemRAG	ChemLLM-7B	19.1%	16.8%	43.2%	2.0%	14.6%	ChemRAG
#18	Mistral-7B (No RAG)	Mistral-7B-Instruct	16.1%	12.6%	45.2%	2.1%	4.7%	ChemRAG

ChemRAG Leaderboard

Benchmarks

Sources

Summary scores across all benchmarks