Selected topics in combinatorics revolving around Vapnik-Chervonenkis dimension
Spring 2023
Lecturer: Szymon Toruńczyk
What is this lecture about?
A big portion of the lectures will be concerned with the Vapnik-Chervonenkis dimension. This fundamental notion has numerous applications in computational learning theory, combinatorics, logic, computational geometry, and theoretical computer science. We will develop the foundations of the theory and study its applications.
Outline
- Set systems, Vapnik-Chervonenkis dimension, Sauer-Shelah-Perles lemma
- Vapnik-Chervonenkis theorem, fundamental theorem of PAC learning, ε-nets
- Helly-type properties, (p,q)-theorem
- Sample compression schemes
- Haussler's packing theorem
- Zarankiewicz's problem, Kovari-Sos-Turan theorem, incidence geometry, Szemerédi-Trotter theorem, Kovari-Sos-Turan for bounded VC-dimension
- Matchings with low crossing number
- Geometric range searching
- Szemeredi regularity
- Regularity theorem for graphs of bounded VC-dimension
- Regularity theorem for stable graphs
- Permutation classes.
Literature
- Lectures in Discrete Geometry, Jiri Matousek
- Geometric Discrepancy, Jiri Matousek
- Topics in Combinatorics, lecture notes, Artem Chernikov,
- Geometric set systems, Jiri Matousek
- Geometric range searching, Jiri Matousek
- Understanding Machine Learning: From Theory to Algorithms, Shai Shalev-Shwartz and Shai Ben-David
Additional sources
- Proofs from the Book
- Combinatorial Geometry, Janos Pach, Pankaj Agarwal,
- Lecture notes on Extremal graph theory, David Conlon,
- Computational Learning Theory, James Worrell,
- Traces of finite sets:
Extremal problems and geometric applications, Furedi, Pach.
- Turan-type theorems:
The history of degenerate (bipartite) extremal graph problems
Zoltan Furedi and Miklos Simonovits,
Specific topics
-
min-max theorem, Farkas lemma, finite-dimensional Hahn-Banach:
Matousek, Lectures on Discrete Geometry; also post by Terence Tao.
-
Szemeredi-Trotter: Matousek, Lectures on Discrete Geometry, Chap. 4.3;
using cutting and Kovari-Sos-Turan: Chap. 4.5.
-
Cutting Lemma: Matousek, Lectures on Discrete Geometry, Chap. 4.6 (weaker version), 4.7 (stronger version)
VC-dimension
Set systems
Definition: A set system consists of a domain X (a set) and a family F⊆P(X) of subsets of X. For Y⊂X, write F∩Y for the set system with domain Y and family {F∩Y:F∈F}.
Remark A set system is the same as a hypergraph.
Examples:
- open intervals: domain R, family {(a,b):a,b∈R,a<b}
- rectangles: domain R2, family {(a,b)×(c,d):a,b,c,d∈R,a<b,c<d}
- discs: domain R2 family {Dr(p):p∈R2,r>0}
- convex sets in R2
- higher-dimensional analogues in Rd,
- neighborhoods in graphs
Usually X=⋃F, then we just mention F.
-
Concepts in machine learning. Goal: given a labelled sample, find a reasonable hypothesis (concept). A labelled sample is the same as a set Y and a set F∩Y, for some concept F∈F. The goal is to "guess" the concept F, or something close to F.
-
In the area of geometric range searching, we fix a set system of geometric shapes in Rd, like the ones in the examples above. The sets in F are called ranges. Given a finite set X⊂Rd, the task is to compute a data structure that allows to quickly answer queries of the form: for a given range F∈F, what is #(F∩X)?
VC-dimension and shatter function
The goal is to define a measure of complexity of a set system. Machine learning intuition: understand which concept classes are learnable. When does Ockham's razor work well?
Definition (VC-dimension) [Vapnik, Chervonenkis, 71]
Let F be a set system with domain X. A set Y⊆X is shattered by F if F∩Y=2Y.
The VC-dimension of F is the maximum size d of a set Y⊆X shattered by F, or ∞
if arbitrarily large sets are shattered.
Definition (Shatter function).
Let F be a set system on X.
Let πF(m)=Y⊆X,∣Y∣≤mmax∣F∩Y∣.
Examples
- half-spaces in Rd have VC-dim at most d+1 (see exercises)
- revisit examples above.
Observation If F has
infinite VC-dimension
then πF(m)=2m for all m.
Definition (Dual set system).
Let F be a set system with domain X.
The dual system F∗ is the system with domain F
and family {x∗:x∈X}, where x∗={F∈F:x∈F}⊂F.
Another view on set systems: bipartite graphs (L,R,E), E⊆L×R. (Multiplicities are ignored).
Dual: (R,L,E−1). In particular, (F∗)∗≅F,
up to removing twins (elements in the domain that belong to
exactly the same sets).
πF∗ is denoted π∗,
and is called the dual shatter function of F.
VC(F∗) is denoted VC∗(F),
and is called the dual VC-dimension of F.
VC(F∗) is the maximal number of sets
in F such that every cell in the Venn diagram
is occupied by some element of the domain.
πF∗(m) is the maximal number of
occupied cells in the Venn diagram of ≤m sets from F.
Example If F is the family of half-planes in R2, then πF∗(m) is the maximal number of regions
into which m half-planes can partition the plane.
For example, πF∗(2)=4 and πF∗(3)=7.
Exercise Prove that πF∗(m)=(2m+1)+1.
Exercise If VC(F)≤d then VC∗(F)<2d+1.
In all examples, we have either a polynomial,
or an exponential shatter function – nothing in between.
Sauer-Shelah-Perles Lemma. [Sauer (72), Shelah and Perles (72), Vapnik-Chervonenkis (71)]
Let F be a set system on n elements of VC-dimension at most d.
Then ∣F∣≤(0n)+…+(dn)=:Φd(n).
It follows that for all m,
πF(m)≤Φd(m)≤(dem)k≤O(md).
Note. This bound is tight. Consider the set system F of all ≤d-element subsets of an infinite set X. Clearly πF(m)=Φd(m).
Proof.
Let X be the domain of the set system,
and pick any x∈X.
Denote F′:=F∩(X−{x}).
How much does F′ differ from F?
There is a natural mapping from F to F′, namely F↦F−{x}.
The preimage of a set G∈F′ under this mapping consists of either exactly two sets,
if G∈F and G∪{x}∈F,
or otherwise, exactly of one set.
Denote Mx:={G∈F∣x∈/G,G∪{x}∈F} It follows that
∣F∣=∣F′∣+∣Mx∣.
Note that both F′ and Mx are set systems with domain X−{x} of size n−1.
The key observation is that Mx has VC-dimension at most d−1.
Indeed, if A⊆X−{x} is shattered in Mx, then A∪{x} is shattered in F.
By induction on n and d we have that:
∣F∣=∣F′∣+∣Mx∣≤(≤dn−1)+(≤(d−1)n−1).
Note that (≤dn−1)+(≤(d−1)n−1)≤(≤dn)
since we can surjectively map a ≤d-element subset A of [n]
to A∈(≤d[n−1]) if n∈/A,
and to A−{x}∈(≤(d−1)[n−1]) if n∈A. ■
Exercise
For every element v in the domain, define a (partial) matching MvF on F, by Mv={{A,B}:A,B∈F,A△B={v}}. Show that if for every subset X of the domain there is v∈X,
with ∣MvF∣X∣≤k, then ∣F∣≤O(k⋅n).
Set systems in logical structures
Let M be a logical structure and ϕ(xˉ;yˉ) be a formula.
For a tuple bˉ∈Myˉ
denote
ϕ(M;bˉ):={aˉ∈Mxˉ:ϕ(aˉ,bˉ)}.
Consider the set system Fϕ
with domain Mxˉ and sets ϕ(M;bˉ), for bˉ∈Myˉ.
Examples
- (R,+,⋅,≤,0,1), formula
ϕ(x;y1,y2)≡y1≤x≤y2.
Then Fϕ is the set system of closed intervals on R.
- (R,+,⋅,≤,0,1), formula
ϕ(x1,x2;y1,y2,y3)≡(x1−y1)2+(x2−y2)2≤p2.
Then Fϕ is the set system of discs in R2.
- (R,+,⋅,≤,0,1)
ϕ(x1,x2;y1,y2,y3,y4)≡y1≤x1≤y2∧y3≤x2≤y4.
Fϕ is the set of rectangles in R2.
- A graph G=(V,E) is seen as a logical structure
with a binary relation E, denoting adjacency.
A first-order formula is a formula using ∃,∀,∨,∧,¬, where the quantifiers range over elements of the domain.
Example ϕ(x,y)≡∃z.y=x+z2
is equivalent in (R,+,⋅,≤,0,1) to x≤y.
So ≤ is definable in (R,+,⋅).
The relation y=ex is not definable in (R,+,⋅).
Example
The formula δ1(x,y)≡(x=y)∨E(x,y)
expresses that dist(x,y)≤1,
and the formula δ2(x,y)≡x=y∨E(x,y)∨∃z.E(x,z)∧E(z,y) expresses that dist(x,y)≤2,
and similarly we can define δr(x,y) expressing dist(x,y)≤r, for any fixed number r.
There is no first-order formula that expresses the property `x and y are at a finite distance' in all graphs.
This can be expressed by a formula of monadic second-order logic (MSO).
Theorem (Shelah, 71) Let M be a logical structure.
The following conditions are equivalent:
- every first-order formula ϕ(xˉ;yˉ) has bounded VC-dimension in M,
- every first-order formula ϕ(x;yˉ) has bounded VC-dimension in M (here x is a singleton).
If the above conditions hold, then M is called NIP.
Example (R,+,⋅,≤) is NIP.
We use the result of Tarski:
Theorem (Tarski, 40) The structure (R,+,⋅,≤,0,1) has quantifier elimination: every first-order formula ϕ(x1,…,xk) is equivalent to a formula without quantifiers.
Such formulas are just boolean combinations of sets of the form:
{(a1,…,ak)∈Rk:p(x1,…,xk)≥0}
or
{(a1,…,ak)∈Rk:p(x1,…,xk)=0}
where p(x1,…,xk) is a polynomial with integer coefficients, such as x12+x22≤5.
Corollary (Tarski, 30)
For every first-order formula ϕ(x;y1,…,yk)
and parameters b1,…,bk∈R,
the set ϕ(R;b1,…,bk)∈R is a union of at most kϕ intervals, where kϕ depends only on ϕ (and not on b1,…,bk).
Proof. We may assume that ϕ is a boolean combination of ℓ sets as above. Let d be the maximum degree of the polynomials. Each polynomial of degree d with one free variable defines a set that is a union of at most d intervals.
A boolean combination of k such sets is a union of at most k:=ℓd intervals.
Corollary The structure (R,⋅,+,≤,0,1) is NIP.
Proof.
The family of all sets that are unions of k intervals cannot shatter a set of 2k+1 points. The conclusion follows from the result of Shelah.
Definition A structure M with a total order < and other relations/functions is o-minimal if
for every first-order formula ϕ(x;y1,…,yk)
and parameters b1,…,bk, the set ϕ(M;b1,…,bk) is a union of finitely many intervals.
It is known that this implies that then it is a union of at most nϕ intervals, for some nϕ that depends only on ϕ.
Corollary Every o-minimal structure is NIP.
Examples of o-minimal structures
- (R,+,⋅,≤,0,1)
- (R,+,⋅,≤,ex,0,1) (Wilkie's theorem)
Other examples of NIP structures
- abelian groups
- Presburger arithmetic (Z,+,<).
Classes of structures
Let C be a class of logical structures.
For example, the class of all planar graphs.
A formula ϕ(xˉ;yˉ) defines
a set system FϕM in each structure M∈C.
We say that ϕ has VC-dimension ≤d in C
if each of the set systems FϕM has VC-dimension at most ≤d.
The shatter function πϕ,F(m)
is defined as the maximum of πFϕM, over all M∈C.
A class C is NIP if every formula ϕ(xˉ;yˉ)
has bounded VC-dimension on C.
Example
Let C a class of graphs and ϕ(x,y)≡E(x,y).
Then πE,C(m) is the following quantity:
the maximum, over all planar graphs G and sets A⊆V(G) with ∣A∣≤m, of the number of distinct vertices v∈V(G)
with different neighborhoods in A.
If C is the class of bipartite graphs, then πE,C(m)=2m.
If C is the class of planar graphs, then πE,C(m)=O(m).
Fact The class C of planar graphs is NIP.
[Podewski, Ziegler, 1978] Moreover, for every fixed first-order formula ϕ(x;y),
we have πϕ,C(m)=O(m)
This holds more generally for all classes C with bounded expansion [Pilipczuk, Siebertz, Toruńczyk, 2018]
and for classes of bounded twin-width [Przybyszewski, 2022].
Classes with bounded expansion include classes of bounded tree-width,
classes of graphs that can be embedded in a fixed surface,
classes of graphs with bounded maximum degree,
and classes that exclude a fixed minor.
They are contained in nowhere dense classes.
πϕ,C(m)=O(m1+ε) holds for every nowhere dense class C and fixed ε>0 [PST, 2018].
VC-density
Definition (VC-density).
The VC-density of a set system F is the infimum
of all r∈R such that πF(m)=O(mr).
Similarly, we define the VC-density of a formula ϕ(xˉ;yˉ) in a class C of structures.
Example Every formula ϕ(xˉ;y) (with y singleton) has VC-density 1 on the class of planar graphs, and more generally, on all nowhere dense classes.
Theorem (Assouad, 1983)
The VC-density of any set system F is either 0, or a real number ≥1.
For every real α≥1 there is a set system F with VC-density α.
We will construct a family with VC-density 3/2. See Chernikov's notes, Example 2.7. For this we use the following.
Theorem (Kovari, Sos, Turan, 1954)
Every K2,2-free bipartite graph with n+n vertices
has O(n3/2) edges.
Example
Fix a prime number p. For every power q of p there is a field Fq with q elements.
Consider the set system Fq where:
- the domain consists of points in Fq2 and lines in Fq2 (solutions of equations y=ax+b).
- the family consists of two-element sets {p,ℓ} where ℓ is a line and p is a point in ℓ.
- πFq(m)=O(m3/2). Pick an m-element subset A.
The restriction Fq∩A
consists of:
- the empty set (1 element),
- singletons (m elements),
- and of pairs {p,ℓ} such that p,ℓ∈A and p∈ℓ.
The number of pairs is the number of edges in the bipartite graph GA
whose parts are the points p∈A, and the lines ℓ∈A, and edges denote incidence. The graph GA is K2,2-free,
so it has at most O(m3/2) edges.
Altogether, ∣Fq∩A∣≤1+m+O(m3/2)=O(m3/2).
On the other hand, ∣Fq∣≥q3 and
there are at most 2q2 points and lines.
Therefore, ∣Fq∣≥Ω(∣points∣+∣lines∣)3/2.
VC-theorem and PAC learning
Let D be a fixed domain.
For a multiset S⊂D and a set F⊂D,
write AvS(F) to denote the average number of points in S that belong to D,
AvS(F)=∣S∣#{s∈S:S∈F}
Definition (ε-approximation).
Fix a probability measure μ on D.
A multiset S is an ε-approximation for μ on a (measurable) set F⊂D if
∣μ(F)−AvS(F)∣≤ε.
A multiset S is an ε-approximation for μ on a set family F with domain F, if S is an ε-approximation for μ on F, for each F∈F.
Recall the weak law of large numbers:
Theorem (Weak law of large numbers)
Fix a probability measure μ on a domain D and a measurable set F⊂D.
For every ε>0 and number n, a randomly and independently selected sequence S of n-elements of D
is an ε-approximation for μ on F with probability at least
1−4nε21.
The VC-theorem is a uniform version, for an entire family F of small VC-dimension. We assume here that the domain is finite. This assumption can be removed at the cost of some additional assumptions.
Theorem (VC-theorem)
Fix a probability measure P on a finite domain D and a family F of subsets of D.
For every ε>0 and number n, a randomly and independently selected sequence of n-elements of D
is an ε-approximation for P on F with probability at least
1−8πF(n)⋅exp(−32nε2).
For the proof, see e.g. Chernikov's lecture notes.
Corollary
For every fixed d∈N and ε>0 there is a number
N(d,ε) such that
if F has VC-dimension d and P is a probability distribution on D,
then there exists an ε-approximation for P on F
of size at most N(d,ε).
Moreover, we can take
N(d,ε)=O(ε2dlog(εd)).
Proof of Corollary.
By Sauer-Shelah-Perles we have that πF(n)≤O(nd).
The key point is that in the value in the VC-theorem,
O(nd)/exp(32nε2),
converges rapidly to 0 as n→∞.
In particular, we can find N(d,ε) large enough so that this number
is smaller than any fixed constant c>0. A more careful analysis shows that
O(ε2dlog(εd))
suffices. See Chernikov's lecture notes. □
Let μ be a probability distribution on (a σ-algebra on) D and let F be a system of (measurable) subsets of D.
For ε>0, a set S⊂D is called a ε-net
for μ on F,
if every set F∈F with μ(F)≥ε
has a nonempty intersection with S.
Lemma Every ε-approximation is an ε-net.
Proof. For every F∈F, if S has an empty intersection with F
then AvF(S)=0. If S is an ε-approximation, this implies that μ(F)≤ε. □
Corollary
For every fixed d∈N and ε>0 there is a number
N(d,ε) such that
if F has VC-dimension d and P is a probability distribution on D,
then there exists an ε-net for P on F
of size at most N(d,ε). In fact, a random (with probability P) sample of N(d,ε) elements of D is an ε-net, with probability at least 1−ε.
Moreover, we can take
N(d,ε)=O(ε2d)log(εd).
In fact, for ε-nets (as opposed to ε-approximations) one can improve the bound above and get
N(d,ε)=O(εd)log(εd).
Although we will use
this bound in the sequel, the improvement
over the previous bound will not be relevant to us.
PAC learning
A concept is a function f:D→[0,1].
A concept class is a family of concepts on D.
We restrict to {0,1}-valued functions (aka sets),
which can be seen as a classification problem.
So now concept classes are just set families.
A labeled sample is a multiset S⊂D with a partition S=S+∪S−.
Denote by SN(D) the set of all labeled samples of size N.
A learning map is a function f that maps every N-sample S∈SN(D) to a generalization f(S)⊂D.
Ideally, f(S) should be consistent with S, that is, S+=S∩f(S) and S−=S∖f(S), but this does not need to happen (it is even desirable to allow errors in the labelling of S).
Intuition See the papaya intuition from Understanding Machine Learning.
Overfitting We pick f(S) to be S+.
To avoid this, we restrict to some family F
of concepts that we are willing to consider,
and require that the range of f should be contained in F.
Empirical Risk Minimization is the learning approach that
we pick f(S) to be any concept C∈F
that minimizes the number of errors on the training set.
That is, the number of points in S such that belong to
S+△C is the smallest possible.
Here F is a predefined set family (representing our knowledge of the world).
PAC learnability
Say that F is PAC-learnable if
for every ε,δ>0 there exists N=N(ε,δ)
and a learning map f:SN(D)→2D such that
for every concept C∈F and probability distribution μ on D,
if we select N elements of D independently with distribution μ each
and label it with respect to C, obtaining a labeled sample S=S+⊎S−,
then we will have that μ(f(S)△C)≤ε with probability
at least 1−δ.
If moreover f always returns concepts in F, then we say that F is
properly PAC learnable.
We say that F is consistently PAC learnable if every function f:SN(D)→F
such that f(S)∈F is any concept that agrees with S, works.
Theorem (Fundamental theorem of PAC learning).
Let F be a concept class.
The following conditions are equivalent:
- F is PAC learnable,
- VC(F) is finite.
Remark There are some untold hidden assumptions in this statement.
Either we should assume some measurability conditions, similar to the
ones in the VC theorem for infinite domains, or we should assume that the domain is finite.
Then the equivalence of the two implications in the theorem should be read as follows. Bottom-up implication: the sample complexity
N(ε,δ) can be bounded in terms of ε,δ and on the VC-dimension of the family F.
Top-down implication: the VC-dimension of F can be bounded in terms of the function N(⋅,⋅). More precisely, we
can bound it in terms of N(31,31).
Proof
For the bottom-up implication, we may assume for simplicity
(by taking the min) that δ=ε.
Let N=N(d,ε) be the number obtained from the ε-net Corollary.
The learning map f:SN(D)→F be defined so that f(S+,S−) is any concept in C∈F that contains S+ and is disjoint from S−.
To prove that this learning map f has the required properties,
we consider the set system
F△C:={F△C:F∈F},
where △ denotes the symmetric difference of sets.
It was shown in the exercises that F△C has the same VC-dimension as F.
By the ε-net corollary, a random sample of N elements of D is an ε-net for the family F△C,
with probability at least 1−ε.
This easily implies that the learning map f has the required property.
The top-down implication was shown in the exercices.
See also Chernikov's notes.
Transversals and Helly-type properties
Transversals
Definition Fix a set system F on a domain D.
- A transversal T is a set T⊆D that intersects every F∈F
- The transversal number of F is the smallest size of a transversal, and is denoted τ(F)
- A packing G is a set G⊆F of pairwise disjoint sets.
- The packing number is the largest size of a packing, and is denoted ν(F).
Clearly τ(F)≥ν(F).
It is NP-hard to compute τ(F) and ν(F).
For this reason, we sometimes use the following, approximate notions.
Definition Fix a set system F on a finite domain D.
- A fractional transversal T is a function f:D→[0,1]
such that for every F∈F we have
x∈F∑f(x)≥1.
Its total weight is ∑x∈Df(x).
- The fractional transversal number of F is the infimum
of the total weights of all fractional transversals, and is denoted τ∗(F).
- A fractional packing G is a function g:F→[0,1]
such that for every x∈D we have
F∈F:x∈F∑g(F)≤1.
Its total weight is ∑F∈Fg(F)
- The fractional packing number is the supremum of the total weights of all fractional packings, and is denoted ν∗(F).
The following lemma is straightforward.
Lemma For every set system F on a finite domain D we have:
τ∗(F)≥ν∗(F)
The following fact is a consequence (or restatement) of the duality theorem for linear programs. It is an easy consequence of Farkas' lemma.
Fact For every set system F on a finite domain D we have
τ∗(F)=ν∗(F).
Moreover, this number is rational, computable in polynomial time,
and is attained both by a
fractional transversal f:D→[0,1] with rational values, and by a
fractional packing g:F→[0,1] with rational values.
Corollary
For every set system F on a finite domain D with VCdim(F)=d, we have:
τ(F)≤O(d)τ∗(F)logτ∗(F).
Proof Follows from the existence of ε-nets
of size O(d)ε1log(ε1).
Namely, pick a fractional transversal f:V→[0,1]
of total weight r,
and normalize it to get a measure μ
with μ(v)=f(v)/r.
For a set F∈F
we have that ∑x∈Ff(v)≥1 since f is a fractional transversal, so μ(F)≥1/r.
Let ε=1/r.
By the ε-net theorem
there is an ε-net T for μ on F
of size O(d)⋅rlogr. Since μ(F)≥1/r=ε for all F∈F and T is an ε-net, it follows that
T intersects every F∈F.
Helly-type theorems
Fact (Helly's Theorem)Fix d≥1.
Let F be a finite family of convex sets in Rd.
If any d+1 sets in F have a point in common,
then F has a point in common.
Definition A family G has the k-Helly property
if for every finite subfamily F⊂G, if every k subsets of F have a point in common, then F has a point in common.
For instance, the family of convex subsets of Rd has the (d+1)-Helly property.
Families of bounded VC-dimension don't necessarily have the k-Helly property, for any finite k.
For example, for an infinite set V, the family of all sets of the form V−{a}, for a∈V, has VC-dimension 2, but does not have the k-Helly property, for all k.
However, families of bounded VC-dimension satisfy a related property.
Say that a family F has the (p,q)-property
if among every p subsets of F,
some q among them have a point in common.
For example, the k-Helly property is the same as the (k,k)-property.
Note that the conclusion in Helly's theorem is that F has a transversal of size 1, that is τ(F)≤1. The following is a generalization.
Fact (Alon,Kleitman). Let p,q,d be integers with p≥q≥d+1.
Then there is a number N=N(p,q,d) such that for every finite family F of convex subsets of Rd,
if F has the (p,q)-property, then τ(F)≤N.
Now this statement generalizes to families of bounded VC-dimension,
and can be derived from the proof of Alon and Kleitman, as observed by Matousek.
Fact (Alon,Kleitman and Matousek). Let p,q,d be integers with p≥q≥d+1.
Then there is a number N=N(p,q,d) such that for every finite family F with VC-dim∗(F)<d+1,
if F has the (p,q)-property, then τ(F)≤N.
Recall that VC-dim∗(F) is the dual VC-dimension of F,
defined as the VC-dimension of F∗,
Equivalently, it is the maximal number k of subsets in F such that
every cell in the Venn diagram of those sets is nonempty (occupied by some element in the domain).
We reprove a duality result which is a corollary
of the (p,q)-theorem.
We present a self-contained proof.
Let E⊂A×B be a binary relation.
For a∈A denote E(a):={b∈B∣(a,b)∈E}
and for b∈B denote E∗(b):={a∈A∣(a,b)∈E}.
Define the VC-dimension of a binary relation E as the maximum of the VC-dimension of two set systems:
(B,{E(a)∣a∈A})and(A,{E∗(b)∣b∈B}).
Here's the duality result.
Theorem (Corollary of the (p,q)-theorem)
For every d there is a number k≤O(d) with the following property.
Let E⊂A×B be a binary relation with VCdim(E)≤d,
with A,B finite.
Then at least one of two cases holds:
- there is a set B′⊂B with ∣B′∣≤k
such that for all a∈A we have E(a)∩B′=∅
(that is, B′ dominates A wrt E), or
- there is a set A′⊂A with ∣A′∣≤k
such that for all b∈B we have E∗(b)∩A′=A′
(that is, A′ dominates B wrt ¬E).
We first state a corollary of von Neumann's minimax theorem (also of Farkas' lemma, the Hahn-Banach theorem, or of the strong duality theorem for linear programming).
Theorem (Minimax Theorem)
Let E⊂A×B be a binary relation with A,B finite, and let α∈R.
Then exactly one of the following two cases holds:
- there is some probability distribution ν on B such that
ν(E(a))≥α holds for all a∈A
- there is some probability distribution μ on A such that
μ(E∗(b))<α
holds for all b∈B.
Proof of Corollary of the (p,q)-theorem
Set α:=1/2 and ε:=1/3;
the point is that 0<α±ε<1.
Fix d∈N, and let k be the number from
the VC-theorem, with k≤O(ε2dlog(ε1))=O(d).
By the Minimax Theorem applied to α=1/2,
one of the two cases below holds.
Case 1: There
is some probability distribution ν on B
such that ν(E(a))≥1/2 holds for all a∈A.
By the VC-theorem applied to ε=1/3, there is a multiset
B′⊂B with ∣B′∣≤k
such that
∣AvB′(E(a))−ν(E(a))∣≤1/3for all a∈A.
Pick a∈A.
Since ν(E(a))≥1/2,
we have that AvB′(E(a))>0,
so B′∩E(a) is nonempty.
Therefore, the first condition in the statement is satisfied.
Case 2: There is some probability distribution
μ on A such that μ(E∗(b))<1/2 holds for all b∈B.
By the VC-theorem again, there is a multiset A′⊂A with ∣A′∣≤k such that
∣AvA′(E∗(b))−μ(E∗(b))∣≤1/3for all b∈B.
Pick b∈B.
Since μ(E∗(b))<1/2,
we have that AvA′(E∗(b))<1,
so A′ is not contained in E∗(b). Therefore, the second condition in the statement is satisfied.
■
Sample compression schemes
This is based on the original paper by Moran and Yehudayoff.
Fix a set family F on a finite domain D.
Let Sd(F) denote the set of labeled samples of size
d, that is, pairs S:=(S+,S−) of multisets of elements of D with ∣S+∣+∣S−∣=d,
such that there exists some F∈F containing S+ and disjoint from S−.
For a labeled sample S and set Z⊂D, let S[D] denote
the sample (S+∩D,S−∩D).
If S and T are two samples, we say that T is a sub-sample of S
if T+⊂S+ and T−⊂S−.
Let Sω(F) denote the union of all Sd(F), for d≥0.
Fix a family F of VC-dimension d, for some fixed d.
A sample compression scheme for F, of size k, consists of two
mappings: a compression mapping, and a reconstruction mapping,
defined as follows.
Compression
The compression map maps each
Y:=(Y+,Y−)∈Sω(F)
to a pair (Z,w), where
Z⊂Y is a sub-sample of Y of size k and w∈{0,1}k is a bit-sequence describing some auxiliary information.
Reconstruction
A reconstruction function maps a given pair (Z,w), where Z∈Sk(F) and w∈{0,1}k, to a concept C⊂2D.
We require that the composition of the two mappings – start with Y∈Sω(F), compress it to some (Z,w), reconstruct some concept C – returns a concept C∈2D that contains Y+ and is disjoint with Y−.
Recall that VCdim∗(F)
is the dual VC-dimension of F.
Theorem Let F be a set system on a finite domain, and d=VCdim(F) and d∗=VCdim∗(F). Then F has a sample compression scheme of size O(d⋅d∗).
Proof
By the VC-theorem, there is s=O(d) and a learning map
f:Ss(F)→F such that for all F∈F
and μ on D there is Z⊂supp(μ):={v∈D∣μ(v)>0} with ∣Z∣≤s
such that
μ(F△f(F∩Z,F−Z))≤1/3.
Let (Y+,Y−)∈Sω(F), and let Y=Y+∪Y−.
The key lemma is the following.
Lemma
There are T≤O(d∗) sets Z1,…,ZT⊆Y+∪Y−, each of size at most s,
such that for every x∈Y+∪Y− we have:
x∈Y+⟺#{t∈{1,…,T}∣x∈f(Y[Zi])}>T/2
With this lemma,
obtaining the compression scheme is straightforward, as we now describe.
Compression
We assume an arbitrary total order on D is fixed.
To compress (Y+,Y−),
we map it to (Z,w) where
Z is the union of the sets as in the statement of the lemma,
and w is a bit-sequence of length T,
whose i-th bit indicates whether the i-th element of Z (in the fixed order on D) belongs to Zi.
This needs to be fixed slightly
Decompression
Given (Z,w), we first compute the sets Z1,…,ZT,
using Z and the bit-sequence w.
We then define a concept C as the set of all x∈D
such that
#{t∈{1,…,T}∣x∈f(Y[Zi])}>T/2.
It follows immediately from the lemma that this is a sample compression scheme of size O(dd∗).
We now prove the lemma.
Proof of lemma
Denote
G:=GY={f(Y+∩Z,Y−∩Z)∣Z⊂Y,∣Z∣≤s}⊆F.
For a concept C∈G and element x∈Y+∪Y−,
say that C is correct on x
if x∈C⟺x∈Y+ holds.
By definition of f,
for every probability distribution μ on Y there is
G∈G such that
μ({x∈Y∣G is correct on x})≥2/3.By the Minimax theorem,
there is a probability distribution ν on G
such that for every x∈Y we have
ν({G∈G∣G is correct on x)≥2/3.By the VC-theorem,
there is an 1/8-approximation
for ν on G∗:
a multiset H⊂G of size T≤O(d∗) such that
for all x∈Y we have:
#H#{H∈H∣H is correct on x}≥ν({G∈G∣G is correct on x})−81≥32−81>21.Taking Z1,…,ZT to be the elements of H,
this proves the lemma, and concludes the theorem.■
Haussler's packing theorem
Fix a domain D and a probabilistic distribution on D.
For two sets X,Y⊆D, write distP(X,Y)
to denote the measure P[X△Y] of the symmetric difference of X and Y.
This defines a metric on 2D.
Fix ε>0.
We are intersted in the problem of packing a maximal number of pairwise disjoint (closed) balls of radius ε in this metric.
Equivalently, of finding the maximal size of a subset of
a 2ε-separated subset S⊆F,
that is, a set S of elements with mutual distance larger than 2ε.
Example. Suppose we are packing discs of radius 0<ε<1 with center in the n×n square in R2.
For a disc D with center in [0,n]×[0,n],
the area of D∩([0,n]×[0,n]) is at least πε2/4=Ω(ε2).
Thus, we can pack at most O((n/ε)2) disjoint discs.
Haussler's packing theorem says that for set systems F of VC-dimension d, the behaviour is similar as in the d-dimensional Euclidean space.
Theorem
Let d>1 and C be constants, and let F be a set system on a finite domain D whose primal shatter function satisfies πF(m)≤Cmd for all m∈N. Let 0<ε<1 and fix a probability distribution on D. If F is ε-separated then ∣F∣≤O(1/εd).
Corollary
Let F be a set system on a
finite domain D and d=VCdim(F).
Let k be a positive integer.
If ∣A△B∣≥k for all A,B∈F
then ∣F∣≤O((n/k)d).
This follows from the theorem
applied to ε=k/n and the uniform distribution on D,
by the Sauer-Shelah-Perles lemma. Conversely, for k=1, the Corollary recovers the SSP lemma, although with worse hidden constants.
Our proof of Haussler's theorem follows a recent proof by
Kupavskii and Zhivotovskiy.
We will prove the following lemma, which quickly yields the theorem.
Lemma (∗). Fix 0<δ<1 and ε>0.
Let F be an ε-separated set system on D, and let μ
be a probability distribution on D. Fix m=εδ2VCdim(F)
and let S be a sample of m independently chosen elements of D according to the distribution μ.
Then
#F≤1−δES[#F∣S].
In particular, if πF(m)≤Cmd for all m,
we have that
#F≤1−δπF(m)≤1−δ1⋅Cmd=C1−δ1(εδ2VCdim(F))d=C(1−δ)δd1⋅εd(2VCdim(F))d
holds for all 0<δ<1.
The minimum, over 0<δ<1 of
(1−δ)δd1 is attained for δ=d/(d+1),
and is at most e(d+1).
Hence we get
#F≤Ce(d+1)(ε2VCdim(F))d≤O(1/εd),
which proves the theorem. It therefore remains to prove the lemma.
For a set S⊆D and F∈F denote by
F∣S the labeled sample F∣S:=(S+,S−)=(F∩S,S−F).
Recall that, given ε>0, the VC-theorem implies the existence of a learning function
f:Sm(F)→2D,
where m=O(d)ε1logε1,
such that for every F∈F and random sample S of m elements of D
chosen independently with a given probability distruction μ,
we have that
μ[f(F∣S)△F]<ε.
Using this, it is easy (and instructive) to prove Lemma (∗) with a weaker bound,
where instead of m=εδ2VCdim(F)
we use m=O(d)ε1logε1.
This would would yield the bound
#F≤O(ε1logε1)d,
which was obtained originally by Dudley.
[To obtain Dudley's bound, proceed as in the proof Lemma (∗) presented below, but instead of the learning function
obtained in the proposition below, use the learning function
obtained from the VC-theorem.]
In order to obtain the improved bound as stated in Lemma (∗), we need a learning function
with the number of sample elements of the form m=Oδ(ϵ1),
rather than m=Oδ(ϵ1logϵ1). This might come at a worse dependency on δ, which we don't care about, because we consider δ to be fixed, say δ=21. This is achieved by the following proposition.
Proposition Let F be a set system on a domain D.
Fix m=εδ2VCdim(F).
There is a "learning function"
f:Sm(F)→2D such that the following holds.
For every F∈F and random
sample S of m elements of D chosen independently with a given probability distribution μ,
we have that
μ[f(F∣S)△F]<2ε
holds with probability at least 1−δ over the sample.
The following lemma is key.
Given a set system F,
consider its Hamming graph HF,
with vertices F and edges connecting two sets F,G∈F
if and only if ∣F△G∣=1.
The edge density of a graph (V,E) is ∣E∣/∣V∣.
Lemma The Hamming graph HF of F has edge density at most VCdim(F).
The proof of the lemma is by repeating the proof of the Sauer-Shelah-Perles lemma and observing that it also yields this conclusion. See also Geometric Discrepancy by Matousek, Lemma 5.15.
This implies that the same holds for every induced subgraph of HF.
Fact
Let G be a graph such that every induced subgraph H of G
has edge density at most d. Then there is an orientation of the edges of G
such that every node has outdegree at most d.
Remark This is very easy to prove with 2d instead of d:
repeatedly remove a vertex v of degree at most 2d,
and orient its edges towards the remaining vertices.
This can be improved to d using Hall's marriage theorem
(see Alon, Tarsi '92).
Corollary
There is an orientation of the edges of HF
such that every node has outdegree at most VCdim(F).
Proof of Proposition
Let d0=VCdim(F).
Fix a probability distribution μ on D.
We will show that there is a function f:Sm(F)→2D
such that
ES[μ(f(F∣S)△F)]<md0
where the expectation is with respect to S.
Markov's inequality then gives the conclusion:
PS[μ(f(F∣S)△F)≥2ε]≤ε/2ES[μ(f(F∣S)△F)]<mε2d0=δ.
We proceed to defining the learning function f.
Fix a set S={s1,…,sm}.
Suppose that we are given the labelling