feat(MachineLearning/PAC): PAC Learning Definitions and Sample Complexity Lower Bounds#483
Conversation
Introduce the core PAC learning model: ConceptClass, LabeledSample, Learner, hypothesisError, sampleOf, and seenElements. Adds module index entry and bibliography references for [EHKV1989] and [Valiant1984].
Define SetShatters and vcDim for concept classes. Bridge to Mathlib's Finset.Shatters via finsetShatters_iff_setShatters.
Reusable lemmas for the EHKV proof: Bernoulli's inequality, product measure support, seenElements measurability, and sample agreement.
Combinatorial core of the EHKV lower bound: an involution on concepts pairs each with its complement on unseen points, showing at least half the concepts force large error on any bad sample.
Construct the discrete probability measure on the shattered set used in the EHKV proof: heavy mass on one point, uniform on the rest.
Assemble the full EHKV proof: Markov bound on bad samples, involution pairing, and adversarial measure yield the lower bound m ≥ (VCdim(C) - 1) / (32ε) for any (ε, δ)-learner.
|
If this is too much to review in one pass, I am happy to make a sequence of PRs to build it up. I think the review commit by commit works personally, but I am happy to oblige whatever folks would prefer. Making a pass of proof golfing right now, I hope some of these lemmas can be simplified at least somewhat. |
- Replace verbose `show ... from by rw [...]` with direct `← lemma` rewrites - Use `tauto` to close symmetric case analysis in hypothesisError_eq_of_inter_eq - Simplify sampleOf_eq_of_agree to one-line `simp` - Shorten SetShatters.subset proof using `▸` and `id` - Replace `Pi.one_apply` for indicator simplification in adversarialMeasure_singleton - Use dot notation `.le`/`.trans` over `le_of_lt`/`le_trans`
Shreyas4991
left a comment
There was a problem hiding this comment.
Initial batch of review comments.
| ## References | ||
|
|
||
| * [L. G. Valiant, *A Theory of the Learnable*][Valiant1984] | ||
| * [A. Ehrenfeucht, D. Haussler, M. Kearns, L. Valiant, |
There was a problem hiding this comment.
Textbook citations are better. Perhaps the Kearns and Vazirani book?
There was a problem hiding this comment.
I think, given the entire formalization is around these papers, that citing these papers is highly appropriate.
There was a problem hiding this comment.
How about we add both?
There was a problem hiding this comment.
that's a better idea. Do recall that as this is a library, we want to use relatively standard definitions and give people a textbook they can consult for them. For example mathlib probability theory cites Kallenberg's textbook
| abbrev ConceptClass (α : Type*) := Set (Set α) | ||
|
|
||
| /-- A *labeled sample* of size `m` over domain `α` is a sequence of `(point, label)` pairs. -/ | ||
| abbrev LabeledSample (α : Type*) (m : ℕ) := Fin m → (α × Bool) |
There was a problem hiding this comment.
You don't need a vector here. Labelled samples are traditionally defined as sets. Their order is only important in online learning. Even in that case, lists are better.
There was a problem hiding this comment.
The indexed-function representation Fin m -> a is precisely what makes the product measure construction work in Mathlib. Switching to Set or List would:
- Lose the fixed-size guarantee (m is the sample size parameter)
- Require a much more complex measure-theoretic setup (measures on List or Finset)
- Not play well with Measure.pi at all
I might not have as much experience as you with these APIs, so if you could share an alternative definition which works more cleanly I am happy to integrate it.
There was a problem hiding this comment.
Lists would better than Sets. You can use https://leanprover-community.github.io/mathlib4_docs/find/?pattern=MeasureTheory.Measure.tprod#doc
There was a problem hiding this comment.
tprod constructs a product measure over a List of measurable spaces, I think it ends up being uglier than the product construction here.
| Each point is labeled `true` if it belongs to the concept and `false` otherwise. -/ | ||
| noncomputable def sampleOf {α : Type*} {m : ℕ} (c : Set α) (xs : Fin m → α) : | ||
| LabeledSample α m := | ||
| fun i => (xs i, decide (xs i ∈ c)) |
There was a problem hiding this comment.
If you still want to use Finvecs (which I think are not the best idea here), you should use FinVec.map. If you switch to lists, then List.map
There was a problem hiding this comment.
It would just add a FinVec.map invocation at the beginning. I think defining this as a function is cleaner. Again, please share a code snippet if you have a cleaner approach.
There was a problem hiding this comment.
"It would just add a FinVec.map invocation" => Yes,here are API lemmas for FinVec.map
Another point. I think you are using classical PAC learning as the basis. In the interests of generality one must use agnostic PAC and specialize it to classical PAC (where the target concept comes from the concept class) and noisy PAC.
There was a problem hiding this comment.
Yes,here are API lemmas for FinVec.map
What do you mean?
In the interests of generality one must use agnostic PAC and specialize it to classical PAC
Is this really necessary for a first PR about this subject, to do the most general thing, or can we prove things about the simpler definitions to start and then generalize later? Given that they should provably reduce to these definitions in their specialized case, it should be simple to do later on.
There was a problem hiding this comment.
CSLib inherits the mathlib tradition of "let's define it in the greatest possible generality". This allows the greatest possible extent of reuse of the same definition in many places. So generalisation where possible is very important.
There was a problem hiding this comment.
I believe we shouldn't preclude the more general definitions, but we don't need to include them right away. For me, a comparable instance is Lambda Calculus. We could have defined a Pure Type System right away and done things in full generality, but that isn't how it went AFAICT.
There was a problem hiding this comment.
That's a nice idea in general, but in lean the library building philosophy is to build a unified library. This has a long history in the lean community at this point, and is a deeply entrenched principle for good reasons. So it is not for me to debate. You could raise it on Zulip if you like. This particular definition is not maximally general and will result in unnecessary API duplication.
|
|
||
| /-- The *error* of a hypothesis `h` with respect to a target concept `c` under distribution `P`, | ||
| defined as the measure of their symmetric difference `h ∆ c`. -/ | ||
| noncomputable def hypothesisError {α : Type*} [MeasurableSpace α] (P : Measure α) |
There was a problem hiding this comment.
You want to differentiate between positive and negative errors (false positives and false negatives resp), and then define error as the combination of them. This will matter in some problems. Even textbook exercises.
There was a problem hiding this comment.
This is a great idea, I will do this.
There was a problem hiding this comment.
Done in the last commit.
…decomposition Add `falsePositiveError` (P(h \ c)) and `falseNegativeError` (P(c \ h)) definitions, and prove `hypothesisError_eq_add` showing total error decomposes as their sum via the disjoint union h ∆ c = (h \ c) ∪ (c \ h).
This PR defines the Probably Approximately Correct (PAC) learning model. Then, it formalizes the information theoretic lower bound on sample complexity for PAC learning implemented in "A general lower bound on the number of examples needed for learning" (Andrzej Ehrenfeucht, David Haussler, Michael Kearns, Leslie Valiant, 1989).
Organized the PR such that it can be reviewed commit by commit.