LLMs for Mathematical Proofs

A Hands-On Workshop for Discovering What Works and What Doesn't

⏱️ Workshop Time: 150 minutes 👥 Work in pairs

Workshop Overview

What You'll Explore

✓ Calibration

Discover where LLMs excel and where they struggle through hands-on experimentation

✓ Effective Collaboration

Experiment with LLMs for brainstorming, understanding papers, and exploring proofs

✓ Progressive Disclosure

Test how LLM performance changes based on the context you provide

✓ Real Workflows

Practice actual research workflows: from paper understanding to proof verification

⚠️ Experimental Approach

Throughout this workshop, document what you observe. When possible, verify claims independently. Keep track of what you can check versus what you must take on faith.

⏱️ Time Sink Alert

Reality check: You'll waste time on dead ends. LLMs can send you down rabbit holes with plausible-sounding but incorrect approaches.

Chasing LLM-generated "connections" that don't pan out
Debugging LLM errors instead of doing actual work
Following plausible but wrong proof strategies

💡 Strategy: Set 10-15 minute time limits per direction. If you're not making progress, pivot.

Choose Your Track

Track A: ML Theory Paper

Work with one of four recent ML papers (all 2025). Some are empirical—use LLMs to help build theoretical understanding.

Recommended - learn both verification and theory-building

Track B: Your Own Paper

Bring a recent paper from your research area (published after March 2025). Apply exercises to your domain.

For those with specific research interests

Track C: IMO 2025 Problem

Work through IMO 2025 Problem 6 (combinatorics). Verify small cases, check construction.

For those preferring pure mathematics

Track D: Open Problem (Advanced)

Data selection open problem (COLT 2025). Learn to stay grounded when there's no ground truth.

For exploring proof generation on unsolved problems

Track A: Choose Your Paper

Pick one paper based on your background. All published in 2025.

Option 1: Weak-to-Strong Generalization (ICML 2025)

Authors: Medvedev, Lyu, Yu, Arora, Li, Srebro

Paper: arxiv:2503.02877

Main Result: Proves conditions under which a larger student model trained on noisy teacher labels can outperform the teacher in a random features setting.

Your Task: In the random-features toy model, prove when a larger student beats the teacher; show the limits of this phenomenon.

✓ Recommended if less familiar with theory - most accessible math

Option 2: Reasoning with Sampling (2025)

Authors: Aayush Karan, Yilun Du (Harvard)

Paper: arxiv:2510.14901

Main Result: Shows that MCMC sampling from "power distributions" p^α of base LLMs can match RL-posttraining performance without any training, using only inference-time compute.

Your Task: This is primarily empirical work. Build theoretical understanding: prove bounds on when/why sharpening distributions helps, analyze the MCMC algorithm's convergence, or characterize what base model properties enable this.

✓ Recommended for those interested in test-time compute & sampling theory

Option 3: Learning Compositional Functions with Transformers from Easy-to-Hard Data (COLT 2025)

Paper: arxiv:2505.23683

Main Focus: A theoretical result on curriculum-based learnability for compositional functions.

Your Task: Verify the paper's learnability claims, test the assumptions of their model, and explore the implications of the easy-to-hard curriculum.

✓ Recommended for those interested in theoretical analysis of reasoning

Option 4: Simple Test-Time Scaling (2025)

Authors: Liang, Li, Fei-Fei, Candès, Hashimoto, Zettlemoyer, Hajishirzi, et al.

Paper: arxiv:2501.19393

Main Result: Empirical demonstration of test-time compute scaling.

Your Task: Reproduce a tiny best-of-N effect; prove a monotonicity/concentration bound in a simple error model.

✓ Recommended if interested in empirical↔theory bridge

Track C: IMO 2025 Problem 6

From the 2025 International Mathematical Olympiad (July, Australia):

Consider a 2025 × 2025 grid of unit squares. Matilda wishes to place on the grid some rectangular tiles, possibly of different sizes, such that each side of every tile lies on a grid line and every unit square is covered by at most one tile.

Determine the minimum number of tiles Matilda needs to place so that each row and each column of the grid has exactly one unit square that is not covered by any tile.

Show Answer

Answer: 2112 tiles (= 2025 + 2·45 - 3). General formula: ⌈n + 2√n - 3⌉

📊 Context: LLM Progress on IMO Problems

IMO 2025 Problem 6 was not solved by any LLM model during the competition. However, from 2024 to 2025 we've seen remarkable progress: from specialized architectures trained specifically for IMO problems to general-purpose models (like GPT-4, Claude, Gemini) solving IMO-level problems. This workshop explores where we are on that frontier.

Your Bounded Tasks:

Note: Use Evan Chen's IMO 2025 Solutions as your reference for verification

Work out the answer for small cases (n=3, n=4, n=9) by hand with LLM help
Verify the construction achieves the claimed bound
Test if the LLM can explain the Erdős-Szekeres bound argument
Check edge cases: what if n is not a perfect square?

Track D: Open Problem - Data Selection for Regression

From COLT 2025 (Conference on Learning Theory):

Authors: Steve Hanneke, Shay Moran, Alexander Shlimovich, Amir Yehudayoff

Central Question: Given a learning rule 𝒜 and a selection budget n, how well can 𝒜 perform when trained on n examples selected from a larger dataset?

The paper poses concrete open problems in basic regression settings including mean estimation and linear regression. Unlike closed problems, you cannot check your answer against a solution key.

Status: Open problem - no known solution

Your Bounded Tasks (what you CAN verify):

Verify LLM correctly understands the problem setup and definitions
Check trivial cases: n=2 (minimal selection) vs n=∞ (all data)
If LLM claims existing results, verify those citations/claims
Test proposed bounds on toy example: mean estimation with 3 data points
Verify calculations in simple cases (e.g., 1D linear regression, Gaussian noise)

⚠️ Different Calibration Challenge:

With open problems, you'll have more ? and ⚠ in your verification log. The learning objective shifts: instead of "can I verify this proof," it's "how do I stay grounded when there's no ground truth?" This mirrors real research.

Source: COLT 2025 Proceedings

📝 Student Observation Log

Document your experiments, verification attempts, and insights

Download Worksheet

The observation log helps you:

Track verification attempts with the ✓✗?⚠ system
Document specific examples of what you observe to share in discussion
Prepare for synthesis by recording observations throughout the workshop
Identify patterns in your experiments

Tip: Fill it out as you go, not at the end. The verification log in Stage 3 is especially important.

⏱️ Time Budget (150 minutes total) — Work in Pairs Throughout

15m

Stage 1
Zero-Shot

30m

Stage 2
Understanding

55m

Stage 3
Verification

25m

Stage 4
Extension

25m

Synthesis
Discussion

⚠️ Flexible Timing: Times are approximate. Stage 1 may take longer if you need to read your paper first. Don't worry about finishing everything—focus on depth over breadth. Many tasks have options (pick what's most interesting).

💬 Pair Work: Discuss with your partner throughout. Share screens, debate verification steps, and challenge each other's assumptions.

Stage 1: Zero-Shot Probe

Testing what the LLM knows without context

Objective

Test what the LLM can do without seeing the paper. Can it prove the main result from just the theorem statement? This establishes a baseline for comparing later stages.

Task 1.1: Cold Start (15 minutes)

Without providing any context from the paper, ask the LLM to prove the main result. Use your paper's theorem statement but don't paste the paper's proof or notation.

Example for Weak-to-Strong paper:

                                
"Prove that in a random features model, a larger student network trained 
on labels from a smaller teacher network can achieve better generalization 
than the teacher itself. Provide a rigorous proof with clear conditions."

Example for IMO Problem 6:

                                
"For an n×n grid where each row and column must have exactly one uncovered square,
what is the minimum number of rectangular tiles needed? Derive the formula and prove
it's optimal. Start with small cases like n=3 or n=4."

Example for Open Problem (Track D):

                                
"Consider the data selection problem: given a dataset and budget n, how well can 
we do by selecting n examples for training? Propose an approach for mean estimation
and derive bounds on the expected error. What selection strategy is optimal?"

Document: What proof strategy does it attempt? Does it cite related work? What assumptions does it make? How confident does it sound? Which steps can you verify?

💬 Pair Discussion (3 min)

Share with your partner: What did the LLM try? How confident did it sound? What could you verify vs what seemed questionable? Keep these observations — you'll compare with later stages.

Stage 2: Understanding & Translation

Using LLMs to understand complex proofs

Objective

Now provide the actual paper (or key excerpts). Experiment with using the LLM to understand notation, proof structure, and key insights.

Task 2.1: Notation Translation (10 minutes)

Provide the paper's notation section and main theorem statement. Ask:

                        
"Here is the notation and main theorem from a paper [paste notation + theorem].
Can you:
1. Explain this notation in simpler terms
2. Rewrite the theorem statement using more standard notation
3. Identify what makes this result non-trivial
4. What are the key assumptions that might be unrealistic?"
                        
                    

Goal: Can the LLM help you understand unfamiliar notation? Does it correctly identify the key difficulties?

Task 2.2: Proof Sketch Extraction (12 minutes)

Provide the proof's introduction and overview. Ask:

                        
"Based on this proof overview [paste text], create:
A high-level proof roadmap with major steps
Identify which steps are 'standard' vs 'novel'
For each novel step, explain the key insight
What lemmas would need to be proven to make this rigorous?"
                        
                    

Task 2.3: Analogical Understanding (8 minutes)

Test if the LLM can relate this to familiar concepts:

                        
"[Describe the main result in one sentence]
Can you:
1. Explain this phenomenon using an analogy from another field
2. Relate this to other known results in ML theory or optimization
3. What would this mean practically?"
                        
                    

💬 Pair Discussion (5 min)

Compare notes: Did the LLM help you understand the paper better? Which explanations were helpful vs confusing? Did it correctly identify the key difficulties?

Stage 3: Gap-Filling & Verification

Testing proof completion and bug-finding abilities

Objective

Papers often skip "routine" steps. Test whether the LLM can fill these gaps. Work through specific examples and document what you can verify independently.

💡 Choose Your Focus: Pick Task 3.1 OR 3.2 based on interest (or do both if time permits). Depth > breadth.

Task 3.1: Missing Steps (15 minutes) [Optional]

Select a lemma from the paper that has a condensed proof. Provide it and ask:

                        
"Here's a lemma from the paper: [paste lemma and its brief proof]
The proof jumps from step X to the conclusion. Can you:
1. Fill in the missing steps rigorously
2. State any implicit assumptions being used
3. Is this jump actually valid or is there a gap?"
                        
                    

Your task: Try to verify each step independently. Document which steps you can check vs which you must take on faith. This log is valuable data.

Time management: If you're stuck debugging an LLM error for >10 minutes, move on and document it as ⚠ (had to trust) or ✗ (found error, couldn't fix).

Task 3.2: Bounded Verification (25 minutes) [Optional]

Work with your partner and the LLM on your paper's specific bounded task:

For Weak-to-Strong paper:

                            
"In the random features toy model, help me:
Construct the simplest case where a larger student beats the teacher
Verify the generalization bounds for both
What happens if we remove the 'larger student' assumption?
Can you prove or find a counterexample?"
                            
                        

For Reasoning with Sampling paper:

                            
"This paper shows MCMC sampling from p^α matches RL performance:
Derive a bound: when does sharpening (α>1) provably help vs hurt?
Analyze their Metropolis-Hastings algorithm: prove/disprove convergence
Simple case: 2-token sequences, compute exact stationary distribution
What base model properties make this work? Formalize the intuition."
                            
                        

For Reasoning Model Thinking paper:

                            
"This paper proves a learnability result for compositional functions:
Verify the problem setup and key definitions.
Test the 'easy-to-hard' curriculum idea on a toy example.
What are the key assumptions in their learnability proof?
Can you find a setting where their curriculum strategy might fail?"
                            
                        

For Test-Time Scaling paper:

                            
"In a simple error model for best-of-N sampling:
Prove a monotonicity bound (accuracy improves with N)
Derive a concentration inequality
What happens in the limit as N → ∞?
Verify the constants in a 2-class, 3-sample regime"
                            
                        

For IMO Problem 6:

                            
"Work through the IMO tiling problem step by step:
Solve by hand for n=3 and n=4 (verify the formula n + 2√n - 3)
Have the LLM explain the construction for n=9 (perfect square case)
Ask it to prove why the construction is valid
Test the Erdős-Szekeres lower bound argument on a simple example"
                            
                        

For Open Problem (Track D):

                            
"Apply verification where possible to the data selection problem:
Verify: does the LLM correctly state the problem setup?
Check trivial bounds: what happens with n=2 vs n=∞?
Test a toy case: mean estimation with [1,2,10], budget n=2
If it cites existing work, verify those citations are real
For any proposed bound, check it on your toy example
Document what you CAN vs CANNOT verify without a solution key."
                            
                        

Verification Log: As you work, document:

✓ Steps you verified independently (e.g., algebraic manipulations)
✗ Steps where you caught the LLM making an error
? Steps where you're uncertain and would need to ask an expert
⚠ Steps where you just had to trust the LLM

Task 3.3: Adversarial Review (15 minutes)

Have the LLM act as a skeptical reviewer:

                        
"You are a skeptical reviewer for a top ML conference. Here's a proof that 
claims [paste main claim + proof sketch]. Your job:
1. Find potential gaps or errors in the reasoning
2. Identify unstated assumptions that might not hold
3. Suggest counterexamples if the conditions are weakened
4. What ablations would make this result stronger?"
                        
                    

Meta-Question: Did it identify real issues or invented fake ones? How do you tell the difference?

Note

Your verification log is the most valuable output from this stage. Bring it to the synthesis discussion — we'll compare what different teams discovered.

Stage 4: Brainstorming & Extension

Using LLMs for research ideation

Objective

Test the LLM's ability to help with research extensions and new directions.

Task 4.1: Generalization Brainstorming (12 minutes)

                        
"This paper proves [state main result]. Can you suggest:
Three ways to generalize this result (different settings, relaxed assumptions, etc.)
For each, explain what would need to change in the proof
Which generalization seems most tractable and why?
What would be the most impactful extension?"
                        
                    

Task 4.2: Connection Finding (13 minutes)

                        
"This result about [main phenomenon]...
What other areas of ML/math have similar phenomena?
Could techniques from those areas apply here?
What practical implications does this have?
What follow-up questions does this raise?"
                        
                    

Note: LLMs are good at brainstorming connections but verify any specific claims about other papers.

💬 Pair Discussion (5 min)

Share ideas: Which extensions seem most promising? Did the LLM suggest anything genuinely novel or just obvious next steps? Which connections were insightful vs superficial?

Synthesis & Reflection

Group Discussion Points (25 minutes)

Share with your partner, then with the larger group:

Verification Confidence Spectrum: For which types of steps could you verify LLM outputs? (algebraic? asymptotic bounds? proof strategy? novel techniques?)
Failure Modes: What types of errors did you observe? Were they obvious or subtle? Did different models make different errors?
Surprises: What worked better than expected? What worked worse?
Information Threshold: How much context did you need to provide before getting useful outputs? Did zero-shot vs full-context make a big difference?
Calibration: Based on your experiments, when would you trust LLM mathematical outputs? When would you be skeptical?
Future Use: Which specific techniques will you actually use in your research workflow?
For Track D (Open Problem): How did you stay grounded without ground truth? What verification strategies worked? How does this compare to closed problems?

✅ Good Use Cases

• Understanding unfamiliar notation
• Brainstorming proof strategies
• Finding related work and connections
• Converting to different formalisms
• Generating counterexample candidates
• Writing and presentation help
• Exploring approaches to open problems (with heavy verification of checkable parts)

⚠️ Areas Requiring Extra Verification

• Multi-step derivations
• Novel proof techniques
• Technical computational steps
• Edge cases in definitions
• Complexity claims and bounds
• Tightness of constants