Skip to content

feat(MachineLearning/PAC): PAC Learning Definitions and Sample Complexity Lower Bounds#483

Open
SamuelSchlesinger wants to merge 8 commits intoleanprover:mainfrom
SamuelSchlesinger:feat/pac-learning-sample-complexity
Open

feat(MachineLearning/PAC): PAC Learning Definitions and Sample Complexity Lower Bounds#483
SamuelSchlesinger wants to merge 8 commits intoleanprover:mainfrom
SamuelSchlesinger:feat/pac-learning-sample-complexity

Conversation

@SamuelSchlesinger
Copy link
Copy Markdown

This PR defines the Probably Approximately Correct (PAC) learning model. Then, it formalizes the information theoretic lower bound on sample complexity for PAC learning implemented in "A general lower bound on the number of examples needed for learning" (Andrzej Ehrenfeucht, David Haussler, Michael Kearns, Leslie Valiant, 1989).

Organized the PR such that it can be reviewed commit by commit.

Introduce the core PAC learning model: ConceptClass, LabeledSample,
Learner, hypothesisError, sampleOf, and seenElements. Adds module
index entry and bibliography references for [EHKV1989] and [Valiant1984].
Define SetShatters and vcDim for concept classes. Bridge to Mathlib's
Finset.Shatters via finsetShatters_iff_setShatters.
Reusable lemmas for the EHKV proof: Bernoulli's inequality, product
measure support, seenElements measurability, and sample agreement.
Combinatorial core of the EHKV lower bound: an involution on concepts
pairs each with its complement on unseen points, showing at least half
the concepts force large error on any bad sample.
Construct the discrete probability measure on the shattered set used
in the EHKV proof: heavy mass on one point, uniform on the rest.
Assemble the full EHKV proof: Markov bound on bad samples, involution
pairing, and adversarial measure yield the lower bound
m ≥ (VCdim(C) - 1) / (32ε) for any (ε, δ)-learner.
@SamuelSchlesinger
Copy link
Copy Markdown
Author

If this is too much to review in one pass, I am happy to make a sequence of PRs to build it up. I think the review commit by commit works personally, but I am happy to oblige whatever folks would prefer.

Making a pass of proof golfing right now, I hope some of these lemmas can be simplified at least somewhat.

- Replace verbose `show ... from by rw [...]` with direct `← lemma` rewrites
- Use `tauto` to close symmetric case analysis in hypothesisError_eq_of_inter_eq
- Simplify sampleOf_eq_of_agree to one-line `simp`
- Shorten SetShatters.subset proof using `▸` and `id`
- Replace `Pi.one_apply` for indicator simplification in adversarialMeasure_singleton
- Use dot notation `.le`/`.trans` over `le_of_lt`/`le_trans`
Copy link
Copy Markdown
Contributor

@Shreyas4991 Shreyas4991 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial batch of review comments.

## References

* [L. G. Valiant, *A Theory of the Learnable*][Valiant1984]
* [A. Ehrenfeucht, D. Haussler, M. Kearns, L. Valiant,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Textbook citations are better. Perhaps the Kearns and Vazirani book?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, given the entire formalization is around these papers, that citing these papers is highly appropriate.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we add both?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a better idea. Do recall that as this is a library, we want to use relatively standard definitions and give people a textbook they can consult for them. For example mathlib probability theory cites Kallenberg's textbook

abbrev ConceptClass (α : Type*) := Set (Set α)

/-- A *labeled sample* of size `m` over domain `α` is a sequence of `(point, label)` pairs. -/
abbrev LabeledSample (α : Type*) (m : ℕ) := Fin m → (α × Bool)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need a vector here. Labelled samples are traditionally defined as sets. Their order is only important in online learning. Even in that case, lists are better.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indexed-function representation Fin m -> a is precisely what makes the product measure construction work in Mathlib. Switching to Set or List would:

  • Lose the fixed-size guarantee (m is the sample size parameter)
  • Require a much more complex measure-theoretic setup (measures on List or Finset)
  • Not play well with Measure.pi at all

I might not have as much experience as you with these APIs, so if you could share an alternative definition which works more cleanly I am happy to integrate it.

Copy link
Copy Markdown
Contributor

@Shreyas4991 Shreyas4991 Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tprod constructs a product measure over a List of measurable spaces, I think it ends up being uglier than the product construction here.

Each point is labeled `true` if it belongs to the concept and `false` otherwise. -/
noncomputable def sampleOf {α : Type*} {m : ℕ} (c : Set α) (xs : Fin m → α) :
LabeledSample α m :=
fun i => (xs i, decide (xs i ∈ c))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you still want to use Finvecs (which I think are not the best idea here), you should use FinVec.map. If you switch to lists, then List.map

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would just add a FinVec.map invocation at the beginning. I think defining this as a function is cleaner. Again, please share a code snippet if you have a cleaner approach.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"It would just add a FinVec.map invocation" => Yes,here are API lemmas for FinVec.map

Another point. I think you are using classical PAC learning as the basis. In the interests of generality one must use agnostic PAC and specialize it to classical PAC (where the target concept comes from the concept class) and noisy PAC.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes,here are API lemmas for FinVec.map

What do you mean?

In the interests of generality one must use agnostic PAC and specialize it to classical PAC

Is this really necessary for a first PR about this subject, to do the most general thing, or can we prove things about the simpler definitions to start and then generalize later? Given that they should provably reduce to these definitions in their specialized case, it should be simple to do later on.

Copy link
Copy Markdown
Contributor

@Shreyas4991 Shreyas4991 Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CSLib inherits the mathlib tradition of "let's define it in the greatest possible generality". This allows the greatest possible extent of reuse of the same definition in many places. So generalisation where possible is very important.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we shouldn't preclude the more general definitions, but we don't need to include them right away. For me, a comparable instance is Lambda Calculus. We could have defined a Pure Type System right away and done things in full generality, but that isn't how it went AFAICT.

Copy link
Copy Markdown
Contributor

@Shreyas4991 Shreyas4991 Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a nice idea in general, but in lean the library building philosophy is to build a unified library. This has a long history in the lean community at this point, and is a deeply entrenched principle for good reasons. So it is not for me to debate. You could raise it on Zulip if you like. This particular definition is not maximally general and will result in unnecessary API duplication.


/-- The *error* of a hypothesis `h` with respect to a target concept `c` under distribution `P`,
defined as the measure of their symmetric difference `h ∆ c`. -/
noncomputable def hypothesisError {α : Type*} [MeasurableSpace α] (P : Measure α)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want to differentiate between positive and negative errors (false positives and false negatives resp), and then define error as the combination of them. This will matter in some problems. Even textbook exercises.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great idea, I will do this.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in the last commit.

…decomposition

Add `falsePositiveError` (P(h \ c)) and `falseNegativeError` (P(c \ h))
definitions, and prove `hypothesisError_eq_add` showing total error
decomposes as their sum via the disjoint union h ∆ c = (h \ c) ∪ (c \ h).
@SamuelSchlesinger SamuelSchlesinger changed the title Feat(MachineLearning/PAC): PAC Learning Definitions and Sample Complexity Lower Bounds feat(MachineLearning/PAC): PAC Learning Definitions and Sample Complexity Lower Bounds Apr 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants