Example using 7B model (q8_0 quantization):
ollama pull zeyu-zheng/BFS-Prover-V2-7B:q8_0
While whole-proof models dominate benchmarks, they tend to have less efficient inference, produce less natural proofs with deeply nested have
statements, face challenges with advanced tactic usage, and offer limited interactivity. Step-level provers generate one tactic at a time, enabling fast, tactically expressive, and interactive theorem proving in Lean.
Watch BFS-Prover-V2 in action on a personal laptop with LLMLean integration in VS Code
We introduce BFS-Prover-V2, the state-of-the-art open-source step-level theorem proving system for Lean4, designed to address the dual challenges of scaling both training and inference in neural theorem proving. Building upon the foundation of BFS-Prover-V1, our approach introduces significant improvements that enable sustained learning over extended training periods and efficient exploration of complex proof spaces.
BFS-Prover-V2 introduces novel solutions through: (1) Training-time scaling: A novel multi-stage expert iteration framework with adaptive tactic-level data filtering and periodic retraining to surmount the performance plateaus that typically curtail long-term post training; (2) Inference-time scaling: A planner-enhanced multi-agent tree search system for hierarchical reasoning that scales performance at inference time.
BFS-Prover-V2 achieves 95.08% and 41.4% on the miniF2F and ProofNet test sets respectively, setting a new state-of-the-art for step-level provers.
A central challenge in scaling expert iteration is managing vast quantities of synthetic data while avoiding diminishing returns and mode collapse. We introduce two key innovations:
Instead of coarse problem-level filtering, we adopt fine-grained tactic-level filtering guided by perplexity distribution. We filter out both low-perplexity tactics (overly simple, no learning signal) and high-perplexity tactics (noisy or suboptimal). This acts as automated curriculum learning, using the model's own uncertainty to identify valuable training data.
When the model plateaus after several expert iterations, we perform a "soft reset": (1) Use the current best prover to re-synthesize and de-noise all past proofs; (2) Retrain a fresh model from scratch on the refined dataset. This temporarily drops performance but increases exploration capacity, allowing the model to surpass previous peaks.
Our hierarchical inference architecture mirrors how human mathematicians work: sketch a high-level plan, then fill in details. It consists of two key components:
The planner (a general-purpose reasoning LLM) decomposes the main theorem into a sequence of intermediate subgoals. The prover (our specialized model) solves subgoals sequentially via tree search. If a subgoal fails, the planner is re-queried to generate a revised plan that corrects the strategy.
Multiple prover agents work in parallel on a single subgoal at a time. Once the first agent finds a proof, all others immediately terminate to prevent redundant computation. Proven subgoals are cached and used as known facts in subsequent proofs.
Robust and generalizable performance across domains
Model | miniF2F-test | miniF2F-valid | ProofNet-test |
---|---|---|---|
BFS-Prover-V2-32B w/ Planner | 95.08% | 95.5% | — |
BFS-Prover-V2-32B | 86.1% | 85.5% | 41.4% |
BFS-Prover-V2-7B | 82.4% | — | — |
Use BFS-Prover-V2 in Lean via LLMLean
Example using 7B model (q8_0 quantization):
ollama pull zeyu-zheng/BFS-Prover-V2-7B:q8_0
-- lakefile.lean
require llmlean from git
"https://github.com/zeyu-zheng/llmlean.git"
@ "bfs-pr"
# or lakefile.toml
[[require]]
name = "llmlean"
git = "https://github.com/zeyu-zheng/llmlean.git"
rev = "bfs-pr"
Example using 7B model (q8_0 quantization):
# ~/.config/llmlean/config.toml
api = "ollama"
model = "zeyu-zheng/BFS-Prover-V2-7B:q8_0"
mode = "parallel"
prompt = "tacticstate"
responseFormat = "tactic"
numSamples = "5"
import Mathlib
import LLMlean
example : ... := by
llmstep "" -- Get suggestions for next tactic
-- or
llmstep "rw" -- Get suggestions for using `rw` next
Scroll horizontally to see all steps
@article{xin2025scaling,
title={Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers},
author={Xin, Ran and Zheng, Zeyu and Nie, Yanchen and Yuan, Kun and Xiao, Xia},
journal={arXiv preprint arXiv:2509.06493},
year={2025}
}