BFS-Prover-V2

A SoTA step-level theorem prover built for real use in Lean

Ran Xin*1 Zeyu Zheng*1,2 Yanchen Nie*1,3 Kun Yuan3 Xia Xiao1

* Equal contribution

1ByteDance Seed    2Carnegie Mellon University    3Peking University



Interactive Theorem Proving Done Right

While whole-proof models dominate benchmarks, they tend to have less efficient inference, produce less natural proofs with deeply nested have statements, face challenges with advanced tactic usage, and offer limited interactivity. Step-level provers generate one tactic at a time, enabling fast, tactically expressive, and interactive theorem proving in Lean.

miniF2F
95.08%
on par w/ best whole-proof models
ProofNet
41.4%
setting a new SoTA
miniF2F
+19.3%
w/ Planner
+10.3%
Pure Tree Search
over previous best step-level prover
💻 Ollama
Use on your own laptop
Real-time tactic suggestions in Lean
via Ollama and LLMLean

Live Demo in Lean

Watch BFS-Prover-V2 in action on a personal laptop with LLMLean integration in VS Code

Introduction

We introduce BFS-Prover-V2, the state-of-the-art open-source step-level theorem proving system for Lean4, designed to address the dual challenges of scaling both training and inference in neural theorem proving. Building upon the foundation of BFS-Prover-V1, our approach introduces significant improvements that enable sustained learning over extended training periods and efficient exploration of complex proof spaces.

BFS-Prover-V2 introduces novel solutions through: (1) Training-time scaling: A novel multi-stage expert iteration framework with adaptive tactic-level data filtering and periodic retraining to surmount the performance plateaus that typically curtail long-term post training; (2) Inference-time scaling: A planner-enhanced multi-agent tree search system for hierarchical reasoning that scales performance at inference time.

BFS-Prover-V2 achieves 95.08% and 41.4% on the miniF2F and ProofNet test sets respectively, setting a new state-of-the-art for step-level provers.

Technical Overview

Scaling up Training: Multi-Stage Expert Iteration

Training Pipeline
Training-time scaling architecture

A central challenge in scaling expert iteration is managing vast quantities of synthetic data while avoiding diminishing returns and mode collapse. We introduce two key innovations:

Adaptive Tactic Filtering

Instead of coarse problem-level filtering, we adopt fine-grained tactic-level filtering guided by perplexity distribution. We filter out both low-perplexity tactics (overly simple, no learning signal) and high-perplexity tactics (noisy or suboptimal). This acts as automated curriculum learning, using the model's own uncertainty to identify valuable training data.

Periodic Retraining: Escaping Local Optima

When the model plateaus after several expert iterations, we perform a "soft reset": (1) Use the current best prover to re-synthesize and de-noise all past proofs; (2) Retrain a fresh model from scratch on the refined dataset. This temporarily drops performance but increases exploration capacity, allowing the model to surpass previous peaks.

Scaling up Inference: Planner-Enhanced Multi-Agent Search

Inference Architecture
Inference-time scaling architecture

Our hierarchical inference architecture mirrors how human mathematicians work: sketch a high-level plan, then fill in details. It consists of two key components:

Planner-Guided Search

The planner (a general-purpose reasoning LLM) decomposes the main theorem into a sequence of intermediate subgoals. The prover (our specialized model) solves subgoals sequentially via tree search. If a subgoal fails, the planner is re-queried to generate a revised plan that corrects the strategy.

Parallel Provers & Shared Subgoal Cache

Multiple prover agents work in parallel on a single subgoal at a time. Once the first agent finds a proof, all others immediately terminate to prevent redundant computation. Proven subgoals are cached and used as known facts in subsequent proofs.

Benchmark Performance

Robust and generalizable performance across domains

Model miniF2F-test miniF2F-valid ProofNet-test
BFS-Prover-V2-32B w/ Planner 95.08% 95.5% —
BFS-Prover-V2-32B 86.1% 85.5% 41.4%
BFS-Prover-V2-7B 82.4% — —
Benchmark performance of BFS-Prover-V2 series

Getting Started

Use BFS-Prover-V2 in Lean via LLMLean

1. Install Ollama & Pull Model

Example using 7B model (q8_0 quantization):

ollama pull zeyu-zheng/BFS-Prover-V2-7B:q8_0

2. Add LLMLean to Lean Project

-- lakefile.lean
require llmlean from git
  "https://github.com/zeyu-zheng/llmlean.git"
  @ "bfs-pr"

# or lakefile.toml
[[require]]
name = "llmlean"
git = "https://github.com/zeyu-zheng/llmlean.git"
rev = "bfs-pr"

3. Configure LLMLean

Example using 7B model (q8_0 quantization):

# ~/.config/llmlean/config.toml
api = "ollama"
model = "zeyu-zheng/BFS-Prover-V2-7B:q8_0"
mode = "parallel"
prompt = "tacticstate"
responseFormat = "tactic"
numSamples = "5"

4. Use in Lean

import Mathlib
import LLMlean

example : ... := by
  llmstep ""  -- Get suggestions for next tactic
  -- or
  llmstep "rw"  -- Get suggestions for using `rw` next

Scroll horizontally to see all steps

Citation

@article{xin2025scaling,
  title={Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers},
  author={Xin, Ran and Zheng, Zeyu and Nie, Yanchen and Yuan, Kun and Xiao, Xia},
  journal={arXiv preprint arXiv:2509.06493},
  year={2025}
}