Selected topics in combinatorics revolving around Vapnik-Chervonenkis dimension

Spring 2023

Lecturer: Szymon Toruńczyk

What is this lecture about? A big portion of the lectures will be concerned with the Vapnik-Chervonenkis dimension. This fundamental notion has numerous applications in computational learning theory, combinatorics, logic, computational geometry, and theoretical computer science. We will develop the foundations of the theory and study its applications.

Outline

  1. Set systems, Vapnik-Chervonenkis dimension, Sauer-Shelah-Perles lemma
  2. Vapnik-Chervonenkis theorem, fundamental theorem of PAC learning, ε-nets
  3. Helly-type properties, (p,q)-theorem
  4. Sample compression schemes
  5. Haussler's packing theorem
  6. Zarankiewicz's problem, Kovari-Sos-Turan theorem, incidence geometry, Szemerédi-Trotter theorem, Kovari-Sos-Turan for bounded VC-dimension
  7. Matchings with low crossing number
  8. Geometric range searching
  9. Szemeredi regularity
  10. Regularity theorem for graphs of bounded VC-dimension
  11. Regularity theorem for stable graphs
  12. Permutation classes.

Literature

Additional sources

Specific topics

VC-dimension

Set systems

Definition: A set system consists of a domain XX (a set) and a family FP(X)\cal F\subseteq P(X) of subsets of XX. For YXY\subset X, write FY{\cal F}\cap Y for the set system with domain YY and family {FY:FF}\{F\cap Y: F\in\cal F\}.

Remark A set system is the same as a hypergraph.

Examples:

Usually X=FX=\bigcup\cal F, then we just mention F\cal F.

VC-dimension and shatter function

The goal is to define a measure of complexity of a set system. Machine learning intuition: understand which concept classes are learnable. When does Ockham's razor work well?

Definition (VC-dimension) [Vapnik, Chervonenkis, 71] Let F\cal F be a set system with domain XX. A set YXY\subseteq X is shattered by F\cal F if FY=2Y{\cal F}\cap Y=2^Y. The VC-dimension of F\cal F is the maximum size dd of a set YXY\subseteq X shattered by F\cal F, or \infty if arbitrarily large sets are shattered.

Definition (Shatter function). Let F\cal F be a set system on XX. Let

πF(m)=maxYX,YmFY.\pi_{\cal F}(m)=\max_{Y\subseteq X, |Y|\le m}|{\cal F}\cap Y|.

Examples

Observation If F\cal F has infinite VC-dimension then πF(m)=2m\pi_{\cal F}(m)=2^m for all mm.

Definition (Dual set system). Let F\cal F be a set system with domain XX. The dual system F\cal F^* is the system with domain F\cal F and family {x:xX}\{x^*:x\in X\}, where x={FF:xF}Fx^*=\{F\in\cal F: x\in F\}\subset\cal F.

Another view on set systems: bipartite graphs (L,R,E)(L,R,E), EL×RE\subseteq L\times R. (Multiplicities are ignored). Dual: (R,L,E1)(R,L,E^{-1}). In particular, (F)F(\cal F^*)^*\cong\cal F, up to removing twins (elements in the domain that belong to exactly the same sets).

πF\pi_{\cal F^*} is denoted π\pi^*_{\cal}, and is called the dual shatter function of F\cal F. VC(F)VC(\cal F^*) is denoted VC(F)VC^*(\cal F), and is called the dual VC-dimension of F\cal F.

VC(F)VC(\cal F^*) is the maximal number of sets in F\cal F such that every cell in the Venn diagram is occupied by some element of the domain. πF(m)\pi^*_{\cal F}(m) is the maximal number of occupied cells in the Venn diagram of m{\le}m sets from F\cal F.

Example If F\cal F is the family of half-planes in R2\mathbb R^2, then πF(m)\pi^*_{\cal F}(m) is the maximal number of regions into which mm half-planes can partition the plane. For example, πF(2)=4\pi^*_{\cal F}(2)=4 and πF(3)=7\pi^*_{\cal F}(3)=7.

Exercise Prove that πF(m)=(m+12)+1\pi^*_{\cal F}(m)={m+1\choose 2}+1.

Exercise If VC(F)dVC(\cal F)\le d then VC(F)<2d+1VC^*(\cal F)< 2^{d+1}.

In all examples, we have either a polynomial, or an exponential shatter function – nothing in between.

Sauer-Shelah-Perles Lemma. [Sauer (72), Shelah and Perles (72), Vapnik-Chervonenkis (71)] Let F\cal F be a set system on nn elements of VC-dimension at most dd. Then

F(n0)++(nd)=:Φd(n).|\cal F|\le {n\choose 0}+\ldots+{n \choose d}=:\Phi_d(n).
It follows that for all mm,

πF(m)Φd(m)(emd)kO(md).\pi_{\cal F}(m)\le \Phi_d(m)\le \left(\frac{em}d\right)^k\le O(m^d).

Note. This bound is tight. Consider the set system F\cal F of all d\le d-element subsets of an infinite set XX. Clearly πF(m)=Φd(m)\pi_{\cal F}(m)=\Phi_d(m).

Proof. Let XX be the domain of the set system, and pick any xXx\in X.

Denote F:=F(X{x})\cal F':=\cal F\cap (X-\{x\}). How much does F\cal F' differ from F\cal F? There is a natural mapping from F\cal F to F\cal F', namely FF{x}F\mapsto F-\{x\}. The preimage of a set GFG\in\cal F' under this mapping consists of either exactly two sets, if GFG\in\cal F and G{x}FG\cup\{x\}\in\cal F, or otherwise, exactly of one set.

Denote

Mx:={GFxG,G{x}F}\cal M_x:=\{G\in\cal F\mid x\notin G,G\cup\{x\}\in \cal F\}
It follows that

F=F+Mx.|\cal F|=|\cal F'|+|\cal M_x|.

Note that both F\cal F' and Mx\cal M_x are set systems with domain X{x}X-\{x\} of size n1n-1. The key observation is that Mx\cal M_x has VC-dimension at most d1d-1. Indeed, if AX{x}A\subseteq X-\{x\} is shattered in Mx\cal M_x, then A{x}A\cup\{x\} is shattered in F\cal F.

By induction on nn and dd we have that:

F=F+Mx(n1d)+(n1(d1)).|\cal F|= |\cal F'|+|\cal M_x|\le {n-1\choose {\le} d}+{n-1\choose {\le} (d-1)}.

Note that

(n1d)+(n1(d1))(nd){n-1\choose {\le} d}+{n-1\choose {\le} (d-1)}\le {n\choose {\le} d}
since we can surjectively map a d{\le}d-element subset AA of [n][n] to A([n1]d)A\in {{[n-1]}\choose {\le d}} if nAn\notin A, and to A{x}([n1](d1))A-\{x\}\in {[n-1]\choose {\le} (d-1)} if nAn\in A. \blacksquare

Exercise For every element vv in the domain, define a (partial) matching MvFM_v^{\cal F} on F\cal F, by

Mv={{A,B}:A,BF,AB={v}}.M_v=\{\{A,B\}:A,B\in \mathcal F, A\triangle B=\{v\}\}.
Show that if for every subset XX of the domain there is vXv\in X, with MvFXk|M_v^{\cal F|_X}|\le k, then FO(kn)|\mathcal F|\le O(k\cdot n).

Set systems in logical structures

Let MM be a logical structure and ϕ(xˉ;yˉ)\phi(\bar x;\bar y) be a formula. For a tuple bˉMyˉ\bar b\in M^{\bar y} denote

ϕ(M;bˉ):={aˉMxˉ:ϕ(aˉ,bˉ)}.\phi(M;\bar b):=\{\bar a\in M^{\bar x}: \phi(\bar a,\bar b)\}.

Consider the set system Fϕ\cal F_\phi with domain MxˉM^{\bar x} and sets ϕ(M;bˉ)\phi(M;\bar b), for bˉMyˉ\bar b\in M^{\bar y}.

Examples

ϕ(x;y1,y2)y1xy2.\phi(x;y_1,y_2)\equiv y_1\le x\le y_2.

Then Fϕ\cal F_\phi is the set system of closed intervals on R\mathbb R.

ϕ(x1,x2;y1,y2,y3)(x1y1)2+(x2y2)2p2.\phi(x_1,x_2;y_1,y_2,y_3)\equiv (x_1-y_1)^2+(x_2-y_2)^2\le p^2.

Then Fϕ\cal F_\phi is the set system of discs in R2\mathbb R^2.

ϕ(x1,x2;y1,y2,y3,y4)y1x1y2y3x2y4.\phi(x_1,x_2;y_1,y_2,y_3,y_4)\equiv y_1\le x_1\le y_2\land y_3\le x_2\le y_4.

Fϕ\cal F_\phi is the set of rectangles in R2\mathbb R^2.

A first-order formula is a formula using ,,,,¬\exists, \forall, \lor,\land,\neg, where the quantifiers range over elements of the domain.

Example ϕ(x,y)z.y=x+z2\phi(x,y)\equiv \exists z.y=x+z^2 is equivalent in (R,+,,,0,1)(\mathbb R,+,\cdot,\le,0,1) to xyx\le y. So \le is definable in (R,+,)(\mathbb R,+,\cdot). The relation y=exy=e^x is not definable in (R,+,)(\mathbb R,+,\cdot).

Example The formula δ1(x,y)(x=y)E(x,y)\delta_1(x,y)\equiv (x=y) \lor E(x,y) expresses that dist(x,y)1dist(x,y)\le 1, and the formula δ2(x,y)x=yE(x,y)z.E(x,z)E(z,y)\delta_2(x,y)\equiv x=y \lor E(x,y)\lor \exists z. E(x,z)\land E(z,y) expresses that dist(x,y)2dist(x,y)\le 2, and similarly we can define δr(x,y)\delta_r(x,y) expressing dist(x,y)rdist(x,y)\le r, for any fixed number rr. There is no first-order formula that expresses the property `xx and yy are at a finite distance' in all graphs. This can be expressed by a formula of monadic second-order logic (MSO).

Theorem (Shelah, 71) Let MM be a logical structure. The following conditions are equivalent:

If the above conditions hold, then MM is called NIP.

Example (R,+,,)(\mathbb R,+,\cdot,\le) is NIP. We use the result of Tarski:

Theorem (Tarski, 40) The structure (R,+,,,0,1)(\mathbb R,+,\cdot,\le,0,1) has quantifier elimination: every first-order formula ϕ(x1,,xk)\phi(x_1,\ldots,x_k) is equivalent to a formula without quantifiers.

Such formulas are just boolean combinations of sets of the form:

{(a1,,ak)Rk:p(x1,,xk)0}\{(a_1,\ldots,a_k)\in \mathbb R^k: p(x_1,\ldots,x_k)\ge 0\}

or

{(a1,,ak)Rk:p(x1,,xk)=0}\{(a_1,\ldots,a_k)\in \mathbb R^k: p(x_1,\ldots,x_k)= 0\}

where p(x1,,xk)p(x_1,\ldots,x_k) is a polynomial with integer coefficients, such as x12+x225x_1^2+x_2^2\le 5.

Corollary (Tarski, 30) For every first-order formula ϕ(x;y1,,yk)\phi(x;y_1,\ldots,y_k) and parameters b1,,bkRb_1,\ldots,b_k\in\mathbb R, the set ϕ(R;b1,,bk)R\phi(\mathbb R;b_1,\ldots,b_k)\in\mathbb R is a union of at most kϕk_\phi intervals, where kϕk_\phi depends only on ϕ\phi (and not on b1,,bkb_1,\ldots,b_k).

Proof. We may assume that ϕ\phi is a boolean combination of \ell sets as above. Let dd be the maximum degree of the polynomials. Each polynomial of degree dd with one free variable defines a set that is a union of at most dd intervals. A boolean combination of kk such sets is a union of at most k:=dk:=\ell d intervals.

Corollary The structure (R,,+,,0,1)(\mathbb R,\cdot,+,\le,0,1) is NIP.

Proof. The family of all sets that are unions of kk intervals cannot shatter a set of 2k+12k+1 points. The conclusion follows from the result of Shelah.

Definition A structure MM with a total order << and other relations/functions is o-minimal if for every first-order formula ϕ(x;y1,,yk)\phi(x;y_1,\ldots,y_k) and parameters b1,,bkb_1,\ldots,b_k, the set ϕ(M;b1,,bk)\phi(M;b_1,\ldots,b_k) is a union of finitely many intervals.

It is known that this implies that then it is a union of at most nϕn_\phi intervals, for some nϕn_\phi that depends only on ϕ\phi.

Corollary Every o-minimal structure is NIP.

Examples of o-minimal structures

Other examples of NIP structures

Classes of structures

Let C\cal C be a class of logical structures. For example, the class of all planar graphs. A formula ϕ(xˉ;yˉ)\phi(\bar x;\bar y) defines a set system FϕM{\cal F}^M_{\phi} in each structure MCM\in\cal C. We say that ϕ\phi has VC-dimension d{\le}d in C\cal C if each of the set systems FϕM{\cal F}^M_\phi has VC-dimension at most d{\le}d. The shatter function πϕ,F(m)\pi_{\phi,\cal F}(m) is defined as the maximum of πFϕM\pi_{{\cal F}^M_\phi}, over all MCM\in\cal C. A class C\cal C is NIP if every formula ϕ(xˉ;yˉ)\phi(\bar x;\bar y) has bounded VC-dimension on C\cal C.

Example Let C\cal C a class of graphs and ϕ(x,y)E(x,y)\phi(x,y)\equiv E(x,y). Then πE,C(m)\pi_{E,\cal C}(m) is the following quantity: the maximum, over all planar graphs GG and sets AV(G)A\subseteq V(G) with Am|A|\le m, of the number of distinct vertices vV(G)v\in V(G) with different neighborhoods in AA. If C\cal C is the class of bipartite graphs, then πE,C(m)=2m\pi_{E,\cal C}(m)=2^m. If C\cal C is the class of planar graphs, then πE,C(m)=O(m)\pi_{E,\cal C}(m)=O(m).

Fact The class C\cal C of planar graphs is NIP. [Podewski, Ziegler, 1978] Moreover, for every fixed first-order formula ϕ(x;y)\phi(x;y), we have πϕ,C(m)=O(m)\pi_{\phi,\cal C}(m)=O(m)

This holds more generally for all classes C\cal C with bounded expansion [Pilipczuk, Siebertz, Toruńczyk, 2018] and for classes of bounded twin-width [Przybyszewski, 2022]. Classes with bounded expansion include classes of bounded tree-width, classes of graphs that can be embedded in a fixed surface, classes of graphs with bounded maximum degree, and classes that exclude a fixed minor.

They are contained in nowhere dense classes. πϕ,C(m)=O(m1+ε)\pi_{\phi,\cal C}(m)=O(m^{1+\varepsilon}) holds for every nowhere dense class C\cal C and fixed ε>0\varepsilon>0 [PST, 2018].

VC-density

Definition (VC-density). The VC-density of a set system F\cal F is the infimum of all rRr\in\mathbb R such that πF(m)=O(mr)\pi_{\cal F}(m)=O(m^r). Similarly, we define the VC-density of a formula ϕ(xˉ;yˉ)\phi(\bar x;\bar y) in a class C\cal C of structures.

Example Every formula ϕ(xˉ;y)\phi(\bar x;y) (with yy singleton) has VC-density 11 on the class of planar graphs, and more generally, on all nowhere dense classes.

Theorem (Assouad, 1983) The VC-density of any set system F\cal F is either 00, or a real number 1{\ge }1. For every real α1\alpha\ge 1 there is a set system F\cal F with VC-density α\alpha.

We will construct a family with VC-density 3/23/2. See Chernikov's notes, Example 2.7. For this we use the following.

Theorem (Kovari, Sos, Turan, 1954) Every K2,2K_{2,2}-free bipartite graph with n+nn+n vertices has O(n3/2)O(n^{3/2}) edges.

Example Fix a prime number pp. For every power qq of pp there is a field Fq\mathbb F_q with qq elements. Consider the set system Fq\cal F_q where:

  1. πFq(m)=O(m3/2)\pi_{\cal F_q}(m)=O(m^{3/2}). Pick an mm-element subset AA. The restriction FqA\cal F_q\cap A consists of:

The number of pairs is the number of edges in the bipartite graph GAG_A whose parts are the points pAp\in A, and the lines A\ell \in A, and edges denote incidence. The graph GAG_A is K2,2K_{2,2}-free, so it has at most O(m3/2)O(m^{3/2}) edges. Altogether, FqA1+m+O(m3/2)=O(m3/2)|{\cal F}_q\cap A|\le 1+m+O(m^{3/2})=O(m^{3/2}).

On the other hand, Fqq3|{\cal F}_q|\ge q^3 and there are at most 2q22q^2 points and lines. Therefore, FqΩ(points+lines)3/2|{\cal F}_q|\ge \Omega(|points|+|lines|)^{3/2}.

VC-theorem and PAC learning

Let DD be a fixed domain. For a multiset SDS\subset D and a set FDF\subset D, write AvS(F)Av_S(F) to denote the average number of points in SS that belong to DD,

AvS(F)=#{sS:SF}SAv_S(F)=\frac {\#\{s\in S: S\in F\}}{|S|}

Definition (ε\varepsilon-approximation). Fix a probability measure μ\mu on DD. A multiset SS is an ε\varepsilon-approximation for μ\mu on a (measurable) set FDF\subset D if

μ(F)AvS(F)ε. |\mu(F)-Av_S(F)|\le \varepsilon.

A multiset SS is an ε\varepsilon-approximation for μ\mu on a set family F\cal F with domain FF, if SS is an ε\varepsilon-approximation for μ\mu on FF, for each FFF\in\cal F.

Recall the weak law of large numbers:

Theorem (Weak law of large numbers) Fix a probability measure μ\mu on a domain DD and a measurable set FDF\subset D. For every ε>0\varepsilon>0 and number nn, a randomly and independently selected sequence SS of nn-elements of DD is an ε\varepsilon-approximation for μ\mu on FF with probability at least 114nε21-\frac 1 {4n\varepsilon^2}.

The VC-theorem is a uniform version, for an entire family F\cal F of small VC-dimension. We assume here that the domain is finite. This assumption can be removed at the cost of some additional assumptions.

Theorem (VC-theorem) Fix a probability measure P\mathbb P on a finite domain DD and a family F\cal F of subsets of DD. For every ε>0\varepsilon>0 and number nn, a randomly and independently selected sequence of nn-elements of DD is an ε\varepsilon-approximation for P\mathbb P on F\cal F with probability at least

18πF(n)exp(nε232).1-8\pi_{\cal F}(n)\cdot \exp({-\frac {n\varepsilon^2}{32}}).

For the proof, see e.g. Chernikov's lecture notes.

Corollary For every fixed dNd\in\N and ε>0\varepsilon>0 there is a number N(d,ε)N(d,\varepsilon) such that if F\cal F has VC-dimension dd and P\mathbb P is a probability distribution on DD, then there exists an ε\varepsilon-approximation for P\mathbb P on F\cal F of size at most N(d,ε)N(d,\varepsilon). Moreover, we can take

N(d,ε)=O(dε2log(dε)).N(d,\varepsilon)=O(\frac{d}{\varepsilon^2}\log \left(\frac d\varepsilon\right)).

Proof of Corollary. By Sauer-Shelah-Perles we have that πF(n)O(nd)\pi_{\cal F}(n)\le O(n^d). The key point is that in the value in the VC-theorem,

O(nd)/exp(nε232),O(n^d)/ \exp({\frac {n\varepsilon^2}{32}}),

converges rapidly to 00 as nn\rightarrow\infty. In particular, we can find N(d,ε)N(d,\varepsilon) large enough so that this number is smaller than any fixed constant c>0c>0. A more careful analysis shows that O(dε2log(dε))O(\frac{d}{\varepsilon^2}\log \left(\frac d\varepsilon\right)) suffices. See Chernikov's lecture notes. \square

Let μ\mu be a probability distribution on (a σ\sigma-algebra on) DD and let F\cal F be a system of (measurable) subsets of DD.

For ε>0\varepsilon>0, a set SDS\subset D is called a ε\varepsilon-net for μ\mu on F\cal F, if every set FFF\in\cal F with μ(F)ε\mu(F)\ge \varepsilon has a nonempty intersection with SS.

Lemma Every ε\varepsilon-approximation is an ε\varepsilon-net. Proof. For every FFF\in \cal F, if SS has an empty intersection with FF then AvF(S)=0Av_F(S)=0. If SS is an ε\varepsilon-approximation, this implies that μ(F)ε\mu(F)\le \varepsilon. \square

Corollary For every fixed dNd\in\N and ε>0\varepsilon>0 there is a number N(d,ε)N(d,\varepsilon) such that if F\cal F has VC-dimension dd and P\mathbb P is a probability distribution on DD, then there exists an ε\varepsilon-net for P\mathbb P on F\cal F of size at most N(d,ε)N(d,\varepsilon). In fact, a random (with probability P\mathbb P) sample of N(d,ε)N(d,\varepsilon) elements of DD is an ε\varepsilon-net, with probability at least 1ε1-\varepsilon. Moreover, we can take

N(d,ε)=O(dε2)log(dε).N(d,\varepsilon)=O(\frac{d}{\varepsilon^2})\log \left(\frac d\varepsilon\right).

In fact, for ε\varepsilon-nets (as opposed to ε\varepsilon-approximations) one can improve the bound above and get

N(d,ε)=O(dε)log(dε).N(d,\varepsilon)=O(\frac{d}{\varepsilon})\log \left(\frac d\varepsilon\right).

Although we will use this bound in the sequel, the improvement over the previous bound will not be relevant to us.

PAC learning

A concept is a function f:D[0,1]f: D\to[0,1]. A concept class is a family of concepts on DD.

We restrict to {0,1}\{0,1\}-valued functions (aka sets), which can be seen as a classification problem. So now concept classes are just set families.

A labeled sample is a multiset SDS\subset D with a partition S=S+SS=S_+\cup S_-. Denote by SN(D)S_N(D) the set of all labeled samples of size NN. A learning map is a function ff that maps every NN-sample SSN(D)S\in S_N(D) to a generalization f(S)Df(S)\subset D. Ideally, f(S)f(S) should be consistent with SS, that is, S+=Sf(S)S_+=S\cap f(S) and S=Sf(S)S_-=S\setminus f(S), but this does not need to happen (it is even desirable to allow errors in the labelling of SS).

Intuition See the papaya intuition from Understanding Machine Learning.

Overfitting We pick f(S)f(S) to be S+S_+. To avoid this, we restrict to some family F\cal F of concepts that we are willing to consider, and require that the range of ff should be contained in F\cal F.

Empirical Risk Minimization is the learning approach that we pick f(S)f(S) to be any concept CFC\in\cal F that minimizes the number of errors on the training set. That is, the number of points in SS such that belong to S+CS_+\triangle C is the smallest possible. Here F\cal F is a predefined set family (representing our knowledge of the world).

PAC learnability

Say that F\cal F is PAC-learnable if for every ε,δ>0\varepsilon,\delta>0 there exists N=N(ε,δ)N=N(\varepsilon,\delta) and a learning map f:SN(D)2Df:S_N(D)\to 2^D such that for every concept CFC\in \cal F and probability distribution μ\mu on DD, if we select NN elements of DD independently with distribution μ\mu each and label it with respect to CC, obtaining a labeled sample S=S+SS=S_+\uplus S_-, then we will have that μ(f(S)C)ε\mu(f(S)\triangle C)\le \varepsilon with probability at least 1δ1-\delta.

If moreover ff always returns concepts in F\cal F, then we say that F\cal F is properly PAC learnable. We say that F\cal F is consistently PAC learnable if every function f:SN(D)Ff:S_N(D)\to \cal F such that f(S)Ff(S)\in \cal F is any concept that agrees with SS, works.

Theorem (Fundamental theorem of PAC learning). Let F\cal F be a concept class. The following conditions are equivalent:

Remark There are some untold hidden assumptions in this statement. Either we should assume some measurability conditions, similar to the ones in the VC theorem for infinite domains, or we should assume that the domain is finite. Then the equivalence of the two implications in the theorem should be read as follows. Bottom-up implication: the sample complexity N(ε,δ)N(\varepsilon,\delta) can be bounded in terms of ε,δ\varepsilon,\delta and on the VC-dimension of the family F\cal F. Top-down implication: the VC-dimension of F\cal F can be bounded in terms of the function N(,)N(\cdot,\cdot). More precisely, we can bound it in terms of N(13,13)N(\frac 1 3,\frac 1 3).

Proof For the bottom-up implication, we may assume for simplicity (by taking the min) that δ=ε\delta=\varepsilon. Let N=N(d,ε)N=N(d,\varepsilon) be the number obtained from the ε\varepsilon-net Corollary.

The learning map f:SN(D)Ff:S_N(D)\to \cal F be defined so that f(S+,S)f(S_+,S_-) is any concept in CFC\in\cal F that contains S+S_+ and is disjoint from SS_-. To prove that this learning map ff has the required properties, we consider the set system

FC:={FC:FF},{\cal F}\triangle C:=\{F\triangle C: F\in \cal F\},

where \triangle denotes the symmetric difference of sets. It was shown in the exercises that FC\cal F\triangle C has the same VC-dimension as F\cal F. By the ε\varepsilon-net corollary, a random sample of NN elements of DD is an ε\varepsilon-net for the family FC\cal F\triangle C, with probability at least 1ε1-\varepsilon. This easily implies that the learning map ff has the required property.

The top-down implication was shown in the exercices. See also Chernikov's notes.

Transversals and Helly-type properties

Transversals

Definition Fix a set system F\cal F on a domain DD.

Clearly τ(F)ν(F)\tau(\cal F)\ge \nu(\cal F). It is NP-hard to compute τ(F)\tau(\cal F) and ν(F)\nu(\cal F). For this reason, we sometimes use the following, approximate notions.

Definition Fix a set system F\cal F on a finite domain DD.

xFf(x)1.\sum_{x\in F}f(x)\ge 1.

Its total weight is xDf(x)\sum_{x\in D}f(x).

FF:xFg(F)1.\sum_{F\in{\cal F}: x\in F}g(F)\le 1.

Its total weight is FFg(F)\sum_{F \in\cal F}g(F)

The following lemma is straightforward.

Lemma For every set system F\cal F on a finite domain DD we have:

τ(F)ν(F)\tau^*(\cal F)\ge \nu^*(\cal F)

The following fact is a consequence (or restatement) of the duality theorem for linear programs. It is an easy consequence of Farkas' lemma.

Fact For every set system F\cal F on a finite domain DD we have

τ(F)=ν(F).\tau^*(\cal F)=\nu^*(\cal F).

Moreover, this number is rational, computable in polynomial time, and is attained both by a fractional transversal f:D[0,1]f:D\to [0,1] with rational values, and by a fractional packing g:F[0,1]g:\cal F\to[0,1] with rational values.

Corollary For every set system F\cal F on a finite domain DD with VCdim(F)=d\text{VCdim}(\cal F)=d, we have:

τ(F)O(d)τ(F)logτ(F).\tau(\cal F)\le O(d)\tau^*(\cal F)\log \tau^*(\cal F).

Proof Follows from the existence of ε\varepsilon-nets of size O(d)1εlog(1ε)O(d)\frac 1 \varepsilon\log (\frac 1 \varepsilon). Namely, pick a fractional transversal f:V[0,1]f:V\to [0,1] of total weight rr, and normalize it to get a measure μ\mu with μ(v)=f(v)/r\mu(v)=f(v)/r. For a set FFF\in\cal F we have that xFf(v)1\sum_{x\in F}f(v)\ge 1 since ff is a fractional transversal, so μ(F)1/r\mu(F)\ge 1/r.

Let ε=1/r\varepsilon=1/r. By the ε\varepsilon-net theorem there is an ε\varepsilon-net TT for μ\mu on F\cal F of size O(d)rlogrO(d)\cdot r\log r. Since μ(F)1/r=ε\mu(F)\ge 1/r=\varepsilon for all FFF\in\cal F and TT is an ε\varepsilon-net, it follows that TT intersects every FFF\in\cal F.

Helly-type theorems

Fact (Helly's Theorem)Fix d1d\ge 1.

Let F\cal F be a finite family of convex sets in Rd\mathbb R^d. If any d+1d+1 sets in F\cal F have a point in common, then F\cal F has a point in common.

Definition A family G\cal G has the kk-Helly property if for every finite subfamily FG\cal F\subset \cal G, if every kk subsets of F\cal F have a point in common, then F\cal F has a point in common.

For instance, the family of convex subsets of Rd\mathbb R^{d} has the (d+1)(d+1)-Helly property.

Families of bounded VC-dimension don't necessarily have the kk-Helly property, for any finite kk. For example, for an infinite set VV, the family of all sets of the form V{a}V-\{a\}, for aVa\in V, has VC-dimension 22, but does not have the kk-Helly property, for all kk.

However, families of bounded VC-dimension satisfy a related property.

Say that a family F\cal F has the (p,q)(p,q)-property if among every pp subsets of F\cal F, some qq among them have a point in common.

For example, the kk-Helly property is the same as the (k,k)(k,k)-property. Note that the conclusion in Helly's theorem is that F\cal F has a transversal of size 11, that is τ(F)1\tau(\cal F)\le 1. The following is a generalization.

Fact (Alon,Kleitman). Let p,q,dp,q,d be integers with pqd+1p\ge q\ge d+1. Then there is a number N=N(p,q,d)N=N(p,q,d) such that for every finite family F\cal F of convex subsets of Rd\mathbb R^d, if F\cal F has the (p,q)(p,q)-property, then τ(F)N\tau({\cal F})\le N.

Now this statement generalizes to families of bounded VC-dimension, and can be derived from the proof of Alon and Kleitman, as observed by Matousek.

Fact (Alon,Kleitman and Matousek). Let p,q,dp,q,d be integers with pqd+1p\ge q\ge d+1. Then there is a number N=N(p,q,d)N=N(p,q,d) such that for every finite family F\cal F with VC-dim(F)<d+1\text{VC-dim}^*(\cal F)<d+1, if F\cal F has the (p,q)(p,q)-property, then τ(F)N\tau(\cal F)\le N.

Recall that VC-dim(F)\text{VC-dim}^*(\cal F) is the dual VC-dimension of F\cal F, defined as the VC-dimension of F\cal F^*, Equivalently, it is the maximal number kk of subsets in F\cal F such that every cell in the Venn diagram of those sets is nonempty (occupied by some element in the domain).

We reprove a duality result which is a corollary of the (p,q)(p,q)-theorem. We present a self-contained proof.

Let EA×BE\subset A\times B be a binary relation. For aAa\in A denote E(a):={bB(a,b)E}E(a):=\{b\in B|(a,b)\in E\} and for bBb\in B denote E(b):={aA(a,b)E}E^*(b):=\{a\in A|(a,b)\in E\}. Define the VC-dimension of a binary relation EE as the maximum of the VC-dimension of two set systems:

(B,{E(a)aA})and(A,{E(b)bB}).(B,\{ E(a)|a\in A\})\qquad\text{and}\qquad (A,\{ E^*(b)|b\in B\}).

Here's the duality result.

Theorem (Corollary of the (p,q)(p,q)-theorem) For every dd there is a number kO(d)k\le O(d) with the following property. Let EA×BE\subset A\times B be a binary relation with VCdim(E)d\textup{VCdim}(E)\le d, with A,BA,B finite. Then at least one of two cases holds:

We first state a corollary of von Neumann's minimax theorem (also of Farkas' lemma, the Hahn-Banach theorem, or of the strong duality theorem for linear programming).

Theorem (Minimax Theorem) Let EA×BE\subset A\times B be a binary relation with A,BA,B finite, and let αR\alpha\in \R. Then exactly one of the following two cases holds:

  1. there is some probability distribution ν\nu on BB such that ν(E(a))α\nu(E(a))\ge \alpha holds for all aAa\in A
  2. there is some probability distribution μ\mu on AA such that μ(E(b))<α\mu(E^*(b))< \alpha holds for all bBb\in B.

Proof of Corollary of the (p,q)(p,q)-theorem Set α:=1/2\alpha:= 1/2 and ε:=1/3\varepsilon:= 1/3; the point is that 0<α±ε<10<\alpha\pm\varepsilon<1. Fix dNd\in\N, and let kk be the number from the VC-theorem, with kO(dε2log(1ε))=O(d)k\le O(\frac{d}{\varepsilon^2}\log(\frac {1}\varepsilon))=O(d). By the Minimax Theorem applied to α=1/2\alpha= 1/2, one of the two cases below holds.

Case 1: There is some probability distribution ν\nu on BB such that ν(E(a))1/2\nu( E(a))\ge 1/2 holds for all aAa\in A.

By the VC-theorem applied to ε=1/3\varepsilon= 1/3, there is a multiset BBB'\subset B with Bk|B'|\le k such that

AvB(E(a))ν(E(a))1/3for all aA.|\textup{Av}_{B'}( E(a))-\nu( E(a))|\le 1/3\qquad\text{for all }a\in A.

Pick aAa\in A. Since ν(E(a))1/2\nu( E(a))\ge 1/2, we have that AvB(E(a))>0\textup{Av}_{B'}( E(a))>0, so BE(a)B'\cap E(a) is nonempty. Therefore, the first condition in the statement is satisfied.

Case 2: There is some probability distribution μ\mu on AA such that μ(E(b))<1/2\mu(E^*(b))< 1/2 holds for all bBb\in B.

By the VC-theorem again, there is a multiset AAA'\subset A with Ak|A'|\le k such that

AvA(E(b))μ(E(b))1/3for all bB.|\textup{Av}_{A'}(E^*(b))-\mu(E^*(b))|\le 1/3\qquad\text{for all }b\in B.

Pick bBb\in B. Since μ(E(b))<1/2\mu(E^*(b))< 1/2, we have that AvA(E(b))<1\textup{Av}_{A'}(E^*(b))<1, so AA' is not contained in E(b)E^*(b). Therefore, the second condition in the statement is satisfied. \blacksquare

Sample compression schemes

This is based on the original paper by Moran and Yehudayoff.

Fix a set family F\cal F on a finite domain DD. Let Sd(F)S_d(\cal F) denote the set of labeled samples of size dd, that is, pairs S:=(S+,S)S:=(S_+,S_-) of multisets of elements of DD with S++S=d|S_+|+|S_-|=d, such that there exists some FFF\in \cal F containing S+S_+ and disjoint from SS_-. For a labeled sample SS and set ZDZ\subset D, let S[D]S[D] denote the sample (S+D,SD)(S_+\cap D,S_-\cap D). If SS and TT are two samples, we say that TT is a sub-sample of SS if T+S+T_+\subset S_+ and TST_-\subset S_-.

Let Sω(F)S_\omega(\cal F) denote the union of all Sd(F)S_d(\cal F), for d0d\ge 0.

Fix a family F\cal F of VC-dimension dd, for some fixed dd.

A sample compression scheme for F\cal F, of size kk, consists of two mappings: a compression mapping, and a reconstruction mapping, defined as follows.

Compression

The compression map maps each Y:=(Y+,Y)Sω(F)Y:=(Y_+,Y_-)\in S_\omega(\cal F) to a pair (Z,w)(Z,w), where ZYZ\subset Y is a sub-sample of YY of size kk and w{0,1}kw\in \{0,1\}^k is a bit-sequence describing some auxiliary information.

Reconstruction

A reconstruction function maps a given pair (Z,w)(Z,w), where ZSk(F)Z\in S_k(\cal F) and w{0,1}kw\in \{0,1\}^k, to a concept C2DC\subset 2^D.

We require that the composition of the two mappings – start with YSω(F)Y\in S_\omega(\cal F), compress it to some (Z,w)(Z,w), reconstruct some concept CC – returns a concept C2DC\in 2^D that contains Y+Y_+ and is disjoint with YY_-.

Recall that VCdim(F)\textup{VCdim}^*(\cal F) is the dual VC-dimension of F\cal F.

Theorem Let F\cal F be a set system on a finite domain, and d=VCdim(F)d=\textup{VCdim}(\cal F) and d=VCdim(F)d^*=\textup{VCdim}^*(\cal F). Then F\cal F has a sample compression scheme of size O(dd)O(d\cdot d^*).

Proof By the VC-theorem, there is s=O(d)s=O(d) and a learning map f:Ss(F)Ff:S_s(\cal F)\to \cal F such that for all FFF\in\cal F and μ\mu on DD there is Zsupp(μ):={vDμ(v)>0}Z\subset\text{supp}(\mu):=\{v\in D|\mu(v)>0\} with Zs|Z|\le s such that

μ(Ff(FZ,FZ))1/3.\mu(F\triangle f(F\cap Z,F-Z))\le 1/3.

Let (Y+,Y)Sω(F)(Y_+,Y_-)\in S_\omega(\cal F), and let Y=Y+YY=Y_+\cup Y_-.

The key lemma is the following.

Lemma There are TO(d)T\le O(d^*) sets Z1,,ZTY+YZ^1,\ldots,Z^T\subseteq Y_+\cup Y_-, each of size at most ss, such that for every xY+Yx\in Y_+\cup Y_- we have:

xY+    #{t{1,,T}xf(Y[Zi])}>T/2x\in Y_+\qquad\iff \qquad \#\{t\in \{1,\ldots,T\}\mid x\in f(Y[Z^i]) \}>T/2

With this lemma, obtaining the compression scheme is straightforward, as we now describe.

Compression We assume an arbitrary total order on DD is fixed. To compress (Y+,Y)(Y_+,Y_-), we map it to (Z,w)(Z,w) where ZZ is the union of the sets as in the statement of the lemma, and ww is a bit-sequence of length TT, whose ii-th bit indicates whether the ii-th element of ZZ (in the fixed order on DD) belongs to ZiZ^i. This needs to be fixed slightly

Decompression Given (Z,w)(Z,w), we first compute the sets Z1,,ZTZ^1,\ldots,Z^T, using ZZ and the bit-sequence ww. We then define a concept CC as the set of all xDx\in D such that

#{t{1,,T}xf(Y[Zi])}>T/2.\qquad \#\{t\in \{1,\ldots,T\}\mid x\in f(Y[Z^i]) \}>T/2.

It follows immediately from the lemma that this is a sample compression scheme of size O(dd)O(dd^*).

We now prove the lemma.

Proof of lemma Denote

G:=GY={f(Y+Z,YZ)ZY,Zs}F.{\cal G}:={\cal G}_Y=\{f(Y_+\cap Z,Y_-\cap Z)\mid Z\subset Y, |Z|\le s\}\subseteq \cal F.

For a concept CGC\in \cal G and element xY+Yx\in Y_+\cup Y_-, say that CC is correct on xx if xC    xY+x\in C\iff x\in Y_+ holds.

By definition of ff, for every probability distribution μ\mu on YY there is GGG\in \cal G such that

μ({xYG is correct on x})2/3.\mu(\{x\in Y\mid \text{$G$ is correct on $x$}\})\ge 2/3.

By the Minimax theorem, there is a probability distribution ν\nu on G\cal G such that for every xYx\in Y we have

ν({GGG is correct on x)2/3.\nu(\{G\in {\cal G}\mid \text{$G$ is correct on $x$})\ge 2/3.

By the VC-theorem, there is an 1/81/8-approximation for ν\nu on G\cal G^*: a multiset HG\cal H\subset \cal G of size TO(d)T\le O(d^*) such that for all xYx\in Y we have:

#{HHH is correct on x}#Hν({GGG is correct on x})182318>12.\frac{\#\{H\in{\cal H}\mid \text{$H$ is correct on $x$} \}}{\#\cal H}\ge \nu(\{G\in{\cal G}\mid \text{$G$ is correct on $x$}\})-\frac 1 8\ge \frac 23-\frac 18>\frac 12.

Taking Z1,,ZTZ^1,\ldots,Z^T to be the elements of H\cal H, this proves the lemma, and concludes the theorem.\blacksquare

Haussler's packing theorem

Fix a domain DD and a probabilistic distribution on DD. For two sets X,YDX,Y\subseteq D, write distP(X,Y)\textup{dist}_{\mathbb P}(X,Y) to denote the measure P[XY]\mathbb P[X\triangle Y] of the symmetric difference of XX and YY. This defines a metric on 2D2^D.

Fix ε>0\varepsilon>0. We are intersted in the problem of packing a maximal number of pairwise disjoint (closed) balls of radius ε\varepsilon in this metric. Equivalently, of finding the maximal size of a subset of a 2ε2\varepsilon-separated subset SF\cal S\subseteq \cal F, that is, a set S\cal S of elements with mutual distance larger than 2ε2\varepsilon.

Example. Suppose we are packing discs of radius 0<ε<10<\varepsilon<1 with center in the n×nn\times n square in R2\mathbb R^2. For a disc DD with center in [0,n]×[0,n][0,n]\times [0,n], the area of D([0,n]×[0,n])D\cap ([0,n]\times [0,n]) is at least πε2/4=Ω(ε2)\pi\varepsilon^2/4=\Omega(\varepsilon^2). Thus, we can pack at most O((n/ε)2)O((n/\varepsilon)^2) disjoint discs.

Haussler's packing theorem says that for set systems F\cal F of VC-dimension dd, the behaviour is similar as in the dd-dimensional Euclidean space.

Theorem Let d>1d>1 and CC be constants, and let F\cal F be a set system on a finite domain DD whose primal shatter function satisfies πF(m)Cmd\pi_{\cal F}(m)\le Cm^d for all mNm\in\N. Let 0<ε<10<\varepsilon<1 and fix a probability distribution on DD. If F\cal F is ε\varepsilon-separated then FO(1/εd)|{\cal F}|\le O(1/\varepsilon^d).

Corollary Let F\cal F be a set system on a finite domain DD and d=VCdim(F)d=\textup{VCdim}(\cal F). Let kk be a positive integer. If ABk|A\triangle B|\ge k for all A,BFA,B\in\cal F then FO((n/k)d)|{\cal F}|\le O((n/k)^d).

This follows from the theorem applied to ε=k/n\varepsilon=k/n and the uniform distribution on DD, by the Sauer-Shelah-Perles lemma. Conversely, for k=1k=1, the Corollary recovers the SSP lemma, although with worse hidden constants.

Our proof of Haussler's theorem follows a recent proof by Kupavskii and Zhivotovskiy.

We will prove the following lemma, which quickly yields the theorem.

Lemma (\ast). Fix 0<δ<10<\delta<1 and ε>0\varepsilon>0. Let F\cal F be an ε\varepsilon-separated set system on DD, and let μ\mu be a probability distribution on DD. Fix m=2VCdim(F)εδm=\frac {2\text{VCdim}(\cal F)}{\varepsilon\delta} and let SS be a sample of mm independently chosen elements of DD according to the distribution μ\mu. Then

#FES[#FS]1δ.\#{\cal F}\le \frac{\mathbb E_S[\#\cal F|_S]}{1-\delta}.

In particular, if πF(m)Cmd\pi_{\cal F}(m)\le Cm^d for all mm, we have that

#FπF(m)1δ11δCmd=C11δ(2VCdim(F)εδ)d=C1(1δ)δd(2VCdim(F))dεd\#{\cal F}\le \frac{\pi_{\cal F}(m)}{1-\delta}\le \frac{1}{1-\delta}\cdot Cm^d=C\frac{1}{1-\delta}\left(\frac{2\text{VCdim}(\cal F)}{\varepsilon\delta}\right)^d=C\frac{1}{(1-\delta)\delta^d}\cdot \frac {(2\text{VCdim}(\cal F))^d}{\varepsilon^{d}}

holds for all 0<δ<10<\delta<1. The minimum, over 0<δ<10<\delta<1 of 1(1δ)δd\frac{1}{(1-\delta)\delta^d} is attained for δ=d/(d+1)\delta=d/(d+1), and is at most e(d+1)e(d+1). Hence we get

#FCe(d+1)(2VCdim(F)ε)dO(1/εd),\#{\cal F}\le Ce(d+1) \left(\frac{2\text{VCdim}(\cal F)}{\varepsilon}\right)^d\le O(1/\varepsilon^d),

which proves the theorem. It therefore remains to prove the lemma.

For a set SDS\subseteq D and FFF\in\cal F denote by FSF|_S the labeled sample FS:=(S+,S)=(FS,SF)F|_S:=(S_+,S_-)=(F\cap S,S-F).

Recall that, given ε>0\varepsilon>0, the VC-theorem implies the existence of a learning function f ⁣:Sm(F)2Df\colon S_m(\cal F)\to 2^D, where m=O(d)1εlog1εm=O(d)\frac {1}{\varepsilon}\log\frac{1}{\varepsilon}, such that for every FFF\in\cal F and random sample SS of mm elements of DD chosen independently with a given probability distruction μ\mu, we have that

μ[f(FS)F]<ε.\mu[f(F|_S)\triangle F]<\varepsilon.

Using this, it is easy (and instructive) to prove Lemma (\ast) with a weaker bound, where instead of m=2VCdim(F)εδm=\frac {2\text{VCdim}(\cal F)}{\varepsilon\delta} we use m=O(d)1εlog1εm=O(d)\frac 1{\varepsilon}\log \frac 1\varepsilon. This would would yield the bound

#FO(1εlog1ε)d,\#{\cal F}\le O\left(\frac 1{\varepsilon}\log \frac 1\varepsilon\right)^d,

which was obtained originally by Dudley. [To obtain Dudley's bound, proceed as in the proof Lemma (\ast) presented below, but instead of the learning function obtained in the proposition below, use the learning function obtained from the VC-theorem.]

In order to obtain the improved bound as stated in Lemma ()(\ast), we need a learning function with the number of sample elements of the form m=Oδ(1ϵ)m=O_\delta(\frac 1 \epsilon), rather than m=Oδ(1ϵlog1ϵ)m=O_{\delta}(\frac 1\epsilon\log \frac 1\epsilon). This might come at a worse dependency on δ\delta, which we don't care about, because we consider δ\delta to be fixed, say δ=12\delta=\frac 1 2. This is achieved by the following proposition.

Proposition Let F\cal F be a set system on a domain DD. Fix m=2VCdim(F)εδm=\frac{2\text{VCdim}(\cal F)}{\varepsilon\delta}. There is a "learning function" f ⁣:Sm(F)2Df\colon S_m(\cal F)\to 2^D such that the following holds. For every FFF\in\cal F and random sample SS of mm elements of DD chosen independently with a given probability distribution μ\mu, we have that

μ[f(FS)F]<ε2\mathbb \mu[f(F|_S)\triangle F]<\frac\varepsilon 2

holds with probability at least 1δ1-\delta over the sample.

The following lemma is key. Given a set system F\cal F, consider its Hamming graph HFH_{\cal F}, with vertices F\cal F and edges connecting two sets F,GFF,G\in \cal F if and only if FG=1|F\triangle G|=1. The edge density of a graph (V,E)(V,E) is E/V|E|/|V|.

Lemma The Hamming graph HFH_{\cal F} of F\cal F has edge density at most VCdim(F)\textup{VCdim}(\cal F).

The proof of the lemma is by repeating the proof of the Sauer-Shelah-Perles lemma and observing that it also yields this conclusion. See also Geometric Discrepancy by Matousek, Lemma 5.15.

This implies that the same holds for every induced subgraph of HFH_{\cal F}.

Fact Let GG be a graph such that every induced subgraph HH of GG has edge density at most dd. Then there is an orientation of the edges of GG such that every node has outdegree at most dd.

Remark This is very easy to prove with 2d2d instead of dd: repeatedly remove a vertex vv of degree at most 2d2d, and orient its edges towards the remaining vertices. This can be improved to dd using Hall's marriage theorem (see Alon, Tarsi '92).

Corollary There is an orientation of the edges of HFH_{\cal F} such that every node has outdegree at most VCdim(F)\textup{VCdim}(\cal F).

Proof of Proposition Let d0=VCdim(F)d_0=\text{VCdim}(\cal F). Fix a probability distribution μ\mu on DD. We will show that there is a function f ⁣:Sm(F)2Df\colon S_m(\cal F)\to 2^D such that

ES[μ(f(FS)F)]<d0m\mathbb E_S\left[ \mu(f(F|_S)\triangle F)\right]<\frac {d_0} m

where the expectation is with respect to SS.

Markov's inequality then gives the conclusion:

PS[μ(f(FS)F)ε2]ES[μ(f(FS)F)]ε/2<2d0mε=δ.\mathbb P_S\left[\mu(f(F|_S)\triangle F)\ge \frac \varepsilon 2\right]\le \frac {\mathbb E_S[\mu(f(F|_S)\triangle F)]}{\varepsilon/2}<\frac {2d_0}{m\varepsilon}=\delta.

We proceed to defining the learning function ff. Fix a set S={s1,,sm}S=\{s_1,\ldots,s_m\}. Suppose that we are given the labelling FS