A Hands-On Workshop for Discovering What Works and What Doesn't
Discover where LLMs excel and where they struggle through hands-on experimentation
Experiment with LLMs for brainstorming, understanding papers, and exploring proofs
Test how LLM performance changes based on the context you provide
Practice actual research workflows: from paper understanding to proof verification
Throughout this workshop, document what you observe. When possible, verify claims independently. Keep track of what you can check versus what you must take on faith.
Reality check: You'll waste time on dead ends. LLMs can send you down rabbit holes with plausible-sounding but incorrect approaches.
💡 Strategy: Set 10-15 minute time limits per direction. If you're not making progress, pivot.
Work with one of four recent ML papers (all 2025). Some are empirical—use LLMs to help build theoretical understanding.
Recommended - learn both verification and theory-building
Bring a recent paper from your research area (published after March 2025). Apply exercises to your domain.
For those with specific research interests
Work through IMO 2025 Problem 6 (combinatorics). Verify small cases, check construction.
For those preferring pure mathematics
Data selection open problem (COLT 2025). Learn to stay grounded when there's no ground truth.
For exploring proof generation on unsolved problems
Pick one paper based on your background. All published in 2025.
Authors: Medvedev, Lyu, Yu, Arora, Li, Srebro
Paper: arxiv:2503.02877
Main Result: Proves conditions under which a larger student model trained on noisy teacher labels can outperform the teacher in a random features setting.
Your Task: In the random-features toy model, prove when a larger student beats the teacher; show the limits of this phenomenon.
✓ Recommended if less familiar with theory - most accessible math
Authors: Aayush Karan, Yilun Du (Harvard)
Paper: arxiv:2510.14901
Main Result: Shows that MCMC sampling from "power distributions" p^α of base LLMs can match RL-posttraining performance without any training, using only inference-time compute.
Your Task: This is primarily empirical work. Build theoretical understanding: prove bounds on when/why sharpening distributions helps, analyze the MCMC algorithm's convergence, or characterize what base model properties enable this.
✓ Recommended for those interested in test-time compute & sampling theory
Paper: arxiv:2505.23683
Main Focus: A theoretical result on curriculum-based learnability for compositional functions.
Your Task: Verify the paper's learnability claims, test the assumptions of their model, and explore the implications of the easy-to-hard curriculum.
✓ Recommended for those interested in theoretical analysis of reasoning
Authors: Liang, Li, Fei-Fei, Candès, Hashimoto, Zettlemoyer, Hajishirzi, et al.
Paper: arxiv:2501.19393
Main Result: Empirical demonstration of test-time compute scaling.
Your Task: Reproduce a tiny best-of-N effect; prove a monotonicity/concentration bound in a simple error model.
✓ Recommended if interested in empirical↔theory bridge
From the 2025 International Mathematical Olympiad (July, Australia):
Consider a 2025 × 2025 grid of unit squares. Matilda wishes to place on the grid some rectangular tiles, possibly of different sizes, such that each side of every tile lies on a grid line and every unit square is covered by at most one tile.
Determine the minimum number of tiles Matilda needs to place so that each row and each column of the grid has exactly one unit square that is not covered by any tile.
Answer: 2112 tiles (= 2025 + 2·45 - 3). General formula: ⌈n + 2√n - 3⌉
📊 Context: LLM Progress on IMO Problems
IMO 2025 Problem 6 was not solved by any LLM model during the competition. However, from 2024 to 2025 we've seen remarkable progress: from specialized architectures trained specifically for IMO problems to general-purpose models (like GPT-4, Claude, Gemini) solving IMO-level problems. This workshop explores where we are on that frontier.
Your Bounded Tasks:
Note: Use Evan Chen's IMO 2025 Solutions as your reference for verification
From COLT 2025 (Conference on Learning Theory):
Authors: Steve Hanneke, Shay Moran, Alexander Shlimovich, Amir Yehudayoff
Central Question: Given a learning rule 𝒜 and a selection budget n, how well can 𝒜 perform when trained on n examples selected from a larger dataset?
The paper poses concrete open problems in basic regression settings including mean estimation and linear regression. Unlike closed problems, you cannot check your answer against a solution key.
Status: Open problem - no known solution
Your Bounded Tasks (what you CAN verify):
⚠️ Different Calibration Challenge:
With open problems, you'll have more ? and ⚠ in your verification log. The learning objective shifts: instead of "can I verify this proof," it's "how do I stay grounded when there's no ground truth?" This mirrors real research.
Source: COLT 2025 Proceedings
Document your experiments, verification attempts, and insights
The observation log helps you:
Tip: Fill it out as you go, not at the end. The verification log in Stage 3 is especially important.
⚠️ Flexible Timing: Times are approximate. Stage 1 may take longer if you need to read your paper first. Don't worry about finishing everything—focus on depth over breadth. Many tasks have options (pick what's most interesting).
💬 Pair Work: Discuss with your partner throughout. Share screens, debate verification steps, and challenge each other's assumptions.
Testing what the LLM knows without context
Test what the LLM can do without seeing the paper. Can it prove the main result from just the theorem statement? This establishes a baseline for comparing later stages.
Without providing any context from the paper, ask the LLM to prove the main result. Use your paper's theorem statement but don't paste the paper's proof or notation.
Example for Weak-to-Strong paper:
"Prove that in a random features model, a larger student network trained
on labels from a smaller teacher network can achieve better generalization
than the teacher itself. Provide a rigorous proof with clear conditions."
Example for IMO Problem 6:
"For an n×n grid where each row and column must have exactly one uncovered square,
what is the minimum number of rectangular tiles needed? Derive the formula and prove
it's optimal. Start with small cases like n=3 or n=4."
Example for Open Problem (Track D):
"Consider the data selection problem: given a dataset and budget n, how well can
we do by selecting n examples for training? Propose an approach for mean estimation
and derive bounds on the expected error. What selection strategy is optimal?"
Document: What proof strategy does it attempt? Does it cite related work? What assumptions does it make? How confident does it sound? Which steps can you verify?
Share with your partner: What did the LLM try? How confident did it sound? What could you verify vs what seemed questionable? Keep these observations — you'll compare with later stages.
Using LLMs to understand complex proofs
Now provide the actual paper (or key excerpts). Experiment with using the LLM to understand notation, proof structure, and key insights.
Provide the paper's notation section and main theorem statement. Ask:
"Here is the notation and main theorem from a paper [paste notation + theorem].
Can you:
1. Explain this notation in simpler terms
2. Rewrite the theorem statement using more standard notation
3. Identify what makes this result non-trivial
4. What are the key assumptions that might be unrealistic?"
Goal: Can the LLM help you understand unfamiliar notation? Does it correctly identify the key difficulties?
Provide the proof's introduction and overview. Ask:
"Based on this proof overview [paste text], create:
1. A high-level proof roadmap with major steps
2. Identify which steps are 'standard' vs 'novel'
3. For each novel step, explain the key insight
4. What lemmas would need to be proven to make this rigorous?"
Test if the LLM can relate this to familiar concepts:
"[Describe the main result in one sentence]
Can you:
1. Explain this phenomenon using an analogy from another field
2. Relate this to other known results in ML theory or optimization
3. What would this mean practically?"
Compare notes: Did the LLM help you understand the paper better? Which explanations were helpful vs confusing? Did it correctly identify the key difficulties?
Testing proof completion and bug-finding abilities
Papers often skip "routine" steps. Test whether the LLM can fill these gaps. Work through specific examples and document what you can verify independently.
💡 Choose Your Focus: Pick Task 3.1 OR 3.2 based on interest (or do both if time permits). Depth > breadth.
Select a lemma from the paper that has a condensed proof. Provide it and ask:
"Here's a lemma from the paper: [paste lemma and its brief proof]
The proof jumps from step X to the conclusion. Can you:
1. Fill in the missing steps rigorously
2. State any implicit assumptions being used
3. Is this jump actually valid or is there a gap?"
Your task: Try to verify each step independently. Document which steps you can check vs which you must take on faith. This log is valuable data.
Time management: If you're stuck debugging an LLM error for >10 minutes, move on and document it as ⚠ (had to trust) or ✗ (found error, couldn't fix).
Work with your partner and the LLM on your paper's specific bounded task:
For Weak-to-Strong paper:
"In the random features toy model, help me:
1. Construct the simplest case where a larger student beats the teacher
2. Verify the generalization bounds for both
3. What happens if we remove the 'larger student' assumption?
4. Can you prove or find a counterexample?"
For Reasoning with Sampling paper:
"This paper shows MCMC sampling from p^α matches RL performance:
1. Derive a bound: when does sharpening (α>1) provably help vs hurt?
2. Analyze their Metropolis-Hastings algorithm: prove/disprove convergence
3. Simple case: 2-token sequences, compute exact stationary distribution
4. What base model properties make this work? Formalize the intuition."
For Reasoning Model Thinking paper:
"This paper proves a learnability result for compositional functions:
1. Verify the problem setup and key definitions.
2. Test the 'easy-to-hard' curriculum idea on a toy example.
3. What are the key assumptions in their learnability proof?
4. Can you find a setting where their curriculum strategy might fail?"
For Test-Time Scaling paper:
"In a simple error model for best-of-N sampling:
1. Prove a monotonicity bound (accuracy improves with N)
2. Derive a concentration inequality
3. What happens in the limit as N → ∞?
4. Verify the constants in a 2-class, 3-sample regime"
For IMO Problem 6:
"Work through the IMO tiling problem step by step:
1. Solve by hand for n=3 and n=4 (verify the formula n + 2√n - 3)
2. Have the LLM explain the construction for n=9 (perfect square case)
3. Ask it to prove why the construction is valid
4. Test the Erdős-Szekeres lower bound argument on a simple example"
For Open Problem (Track D):
"Apply verification where possible to the data selection problem:
1. Verify: does the LLM correctly state the problem setup?
2. Check trivial bounds: what happens with n=2 vs n=∞?
3. Test a toy case: mean estimation with [1,2,10], budget n=2
4. If it cites existing work, verify those citations are real
5. For any proposed bound, check it on your toy example
Document what you CAN vs CANNOT verify without a solution key."
Verification Log: As you work, document:
Have the LLM act as a skeptical reviewer:
"You are a skeptical reviewer for a top ML conference. Here's a proof that
claims [paste main claim + proof sketch]. Your job:
1. Find potential gaps or errors in the reasoning
2. Identify unstated assumptions that might not hold
3. Suggest counterexamples if the conditions are weakened
4. What ablations would make this result stronger?"
Meta-Question: Did it identify real issues or invented fake ones? How do you tell the difference?
Your verification log is the most valuable output from this stage. Bring it to the synthesis discussion — we'll compare what different teams discovered.
Using LLMs for research ideation
Test the LLM's ability to help with research extensions and new directions.
"This paper proves [state main result]. Can you suggest:
1. Three ways to generalize this result (different settings, relaxed assumptions, etc.)
2. For each, explain what would need to change in the proof
3. Which generalization seems most tractable and why?
4. What would be the most impactful extension?"
"This result about [main phenomenon]...
1. What other areas of ML/math have similar phenomena?
2. Could techniques from those areas apply here?
3. What practical implications does this have?
4. What follow-up questions does this raise?"
Note: LLMs are good at brainstorming connections but verify any specific claims about other papers.
Share ideas: Which extensions seem most promising? Did the LLM suggest anything genuinely novel or just obvious next steps? Which connections were insightful vs superficial?
Share with your partner, then with the larger group: