feat(MachineLearning/PAC): PAC Learning Definitions and Sample Complexity Lower Bounds by SamuelSchlesinger · Pull Request #483 · leanprover/cslib

SamuelSchlesinger · 2026-04-10T00:04:29Z

This PR defines the Probably Approximately Correct (PAC) learning model. Then, it formalizes the information theoretic lower bound on sample complexity for PAC learning implemented in "A general lower bound on the number of examples needed for learning" (Andrzej Ehrenfeucht, David Haussler, Michael Kearns, Leslie Valiant, 1989).

Organized the PR such that it can be reviewed commit by commit.

Introduce the core PAC learning model: ConceptClass, LabeledSample, Learner, hypothesisError, sampleOf, and seenElements. Adds module index entry and bibliography references for [EHKV1989] and [Valiant1984].

Define SetShatters and vcDim for concept classes. Bridge to Mathlib's Finset.Shatters via finsetShatters_iff_setShatters.

Reusable lemmas for the EHKV proof: Bernoulli's inequality, product measure support, seenElements measurability, and sample agreement.

Combinatorial core of the EHKV lower bound: an involution on concepts pairs each with its complement on unseen points, showing at least half the concepts force large error on any bad sample.

Construct the discrete probability measure on the shattered set used in the EHKV proof: heavy mass on one point, uniform on the rest.

Assemble the full EHKV proof: Markov bound on bad samples, involution pairing, and adversarial measure yield the lower bound m ≥ (VCdim(C) - 1) / (32ε) for any (ε, δ)-learner.

SamuelSchlesinger · 2026-04-10T01:01:07Z

If this is too much to review in one pass, I am happy to make a sequence of PRs to build it up. I think the review commit by commit works personally, but I am happy to oblige whatever folks would prefer.

Making a pass of proof golfing right now, I hope some of these lemmas can be simplified at least somewhat.

- Replace verbose `show ... from by rw [...]` with direct `← lemma` rewrites - Use `tauto` to close symmetric case analysis in hypothesisError_eq_of_inter_eq - Simplify sampleOf_eq_of_agree to one-line `simp` - Shorten SetShatters.subset proof using `▸` and `id` - Replace `Pi.one_apply` for indicator simplification in adversarialMeasure_singleton - Use dot notation `.le`/`.trans` over `le_of_lt`/`le_trans`

Shreyas4991

Initial batch of review comments.

Shreyas4991 · 2026-04-13T21:55:02Z

Cslib/MachineLearning/PACLearning/Defs.lean

+## References
+
+* [L. G. Valiant, *A Theory of the Learnable*][Valiant1984]
+* [A. Ehrenfeucht, D. Haussler, M. Kearns, L. Valiant,


Textbook citations are better. Perhaps the Kearns and Vazirani book?

I think, given the entire formalization is around these papers, that citing these papers is highly appropriate.

How about we add both?

that's a better idea. Do recall that as this is a library, we want to use relatively standard definitions and give people a textbook they can consult for them. For example mathlib probability theory cites Kallenberg's textbook

Shreyas4991 · 2026-04-13T21:55:55Z

Cslib/MachineLearning/PACLearning/Defs.lean

+abbrev ConceptClass (α : Type*) := Set (Set α)
+
+/-- A *labeled sample* of size `m` over domain `α` is a sequence of `(point, label)` pairs. -/
+abbrev LabeledSample (α : Type*) (m : ℕ) := Fin m → (α × Bool)


You don't need a vector here. Labelled samples are traditionally defined as sets. Their order is only important in online learning. Even in that case, lists are better.

The indexed-function representation Fin m -> a is precisely what makes the product measure construction work in Mathlib. Switching to Set or List would:

Lose the fixed-size guarantee (m is the sample size parameter)

Require a much more complex measure-theoretic setup (measures on List or Finset)

Not play well with Measure.pi at all

I might not have as much experience as you with these APIs, so if you could share an alternative definition which works more cleanly I am happy to integrate it.

Lists would better than Sets. You can use https://leanprover-community.github.io/mathlib4_docs/find/?pattern=MeasureTheory.Measure.tprod#doc

tprod constructs a product measure over a List of measurable spaces, I think it ends up being uglier than the product construction here.

Shreyas4991 · 2026-04-13T21:57:10Z

Cslib/MachineLearning/PACLearning/Defs.lean

+Each point is labeled `true` if it belongs to the concept and `false` otherwise. -/
+noncomputable def sampleOf {α : Type*} {m : ℕ} (c : Set α) (xs : Fin m → α) :
+    LabeledSample α m :=
+  fun i => (xs i, decide (xs i ∈ c))


If you still want to use Finvecs (which I think are not the best idea here), you should use FinVec.map. If you switch to lists, then List.map

It would just add a FinVec.map invocation at the beginning. I think defining this as a function is cleaner. Again, please share a code snippet if you have a cleaner approach.

"It would just add a FinVec.map invocation" => Yes,here are API lemmas for FinVec.map

Another point. I think you are using classical PAC learning as the basis. In the interests of generality one must use agnostic PAC and specialize it to classical PAC (where the target concept comes from the concept class) and noisy PAC.

Yes,here are API lemmas for FinVec.map

What do you mean?

In the interests of generality one must use agnostic PAC and specialize it to classical PAC

Is this really necessary for a first PR about this subject, to do the most general thing, or can we prove things about the simpler definitions to start and then generalize later? Given that they should provably reduce to these definitions in their specialized case, it should be simple to do later on.

CSLib inherits the mathlib tradition of "let's define it in the greatest possible generality". This allows the greatest possible extent of reuse of the same definition in many places. So generalisation where possible is very important.

I believe we shouldn't preclude the more general definitions, but we don't need to include them right away. For me, a comparable instance is Lambda Calculus. We could have defined a Pure Type System right away and done things in full generality, but that isn't how it went AFAICT.

That's a nice idea in general, but in lean the library building philosophy is to build a unified library. This has a long history in the lean community at this point, and is a deeply entrenched principle for good reasons. So it is not for me to debate. You could raise it on Zulip if you like. This particular definition is not maximally general and will result in unnecessary API duplication.

Shreyas4991 · 2026-04-13T21:58:40Z

Cslib/MachineLearning/PACLearning/Defs.lean

+
+/-- The *error* of a hypothesis `h` with respect to a target concept `c` under distribution `P`,
+defined as the measure of their symmetric difference `h ∆ c`. -/
+noncomputable def hypothesisError {α : Type*} [MeasurableSpace α] (P : Measure α)


You want to differentiate between positive and negative errors (false positives and false negatives resp), and then define error as the combination of them. This will matter in some problems. Even textbook exercises.

This is a great idea, I will do this.

Done in the last commit.

…decomposition Add `falsePositiveError` (P(h \ c)) and `falseNegativeError` (P(c \ h)) definitions, and prove `hypothesisError_eq_add` showing total error decomposes as their sum via the disjoint union h ∆ c = (h \ c) ∪ (c \ h).

SamuelSchlesinger added 6 commits April 9, 2026 19:59

feat(MachineLearning/PACLearning): add PAC learning definitions

525a336

Introduce the core PAC learning model: ConceptClass, LabeledSample, Learner, hypothesisError, sampleOf, and seenElements. Adds module index entry and bibliography references for [EHKV1989] and [Valiant1984].

feat(MachineLearning/PACLearning): add VC dimension and shattering

8b68590

Define SetShatters and vcDim for concept classes. Bridge to Mathlib's Finset.Shatters via finsetShatters_iff_setShatters.

feat(MachineLearning/PACLearning): add sample complexity helper lemmas

321a1c3

Reusable lemmas for the EHKV proof: Bernoulli's inequality, product measure support, seenElements measurability, and sample agreement.

feat(MachineLearning/PACLearning): add involution pairing argument

a327384

Combinatorial core of the EHKV lower bound: an involution on concepts pairs each with its complement on unseen points, showing at least half the concepts force large error on any bad sample.

feat(MachineLearning/PACLearning): add adversarial measure construction

d58580c

Construct the discrete probability measure on the shattered set used in the EHKV proof: heavy mass on one point, uniform on the rest.

feat(MachineLearning/PACLearning): EHKV sample complexity lower bound

0282b8e

Assemble the full EHKV proof: Markov bound on bad samples, involution pairing, and adversarial measure yield the lower bound m ≥ (VCdim(C) - 1) / (32ε) for any (ε, δ)-learner.

SamuelSchlesinger requested review from chenson2018 and fmontesi as code owners April 10, 2026 00:04

Shreyas4991 suggested changes Apr 13, 2026

View reviewed changes

SamuelSchlesinger changed the title ~~Feat(MachineLearning/PAC): PAC Learning Definitions and Sample Complexity Lower Bounds~~ feat(MachineLearning/PAC): PAC Learning Definitions and Sample Complexity Lower Bounds Apr 13, 2026

Conversation

SamuelSchlesinger commented Apr 10, 2026

Uh oh!

SamuelSchlesinger commented Apr 10, 2026

Uh oh!

Shreyas4991 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Shreyas4991 Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Shreyas4991 Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Shreyas4991 Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Shreyas4991 Apr 13, 2026 •

edited

Loading

Shreyas4991 Apr 13, 2026 •

edited

Loading

Shreyas4991 Apr 14, 2026 •

edited

Loading