Selected topics in combinatorics revolving around Vapnik-Chervonenkis dimension

Spring 2023

Lecturer: Szymon Toruńczyk

What is this lecture about? A big portion of the lectures will be concerned with the Vapnik-Chervonenkis dimension. This fundamental notion has numerous applications in computational learning theory, combinatorics, logic, computational geometry, and theoretical computer science. We will develop the foundations of the theory and study its applications.

Outline

  1. Set systems, Vapnik-Chervonenkis dimension, Sauer-Shelah-Perles lemma
  2. Vapnik-Chervonenkis theorem, fundamental theorem of PAC learning, ε-nets
  3. Helly-type properties, (p,q)-theorem
  4. Sample compression schemes
  5. Haussler's packing theorem
  6. Zarankiewicz's problem, Kovari-Sos-Turan theorem, incidence geometry, Szemerédi-Trotter theorem, Kovari-Sos-Turan for bounded VC-dimension
  7. Matchings with low crossing number
  8. Geometric range searching
  9. Szemeredi regularity
  10. Regularity theorem for graphs of bounded VC-dimension
  11. Regularity theorem for stable graphs
  12. Permutation classes.

Literature

Additional sources

Specific topics

VC-dimension

Set systems

Definition: A set system consists of a domain XX (a set) and a family FP(X)\cal F\subseteq P(X) of subsets of XX. For YXY\subset X, write FY{\cal F}\cap Y for the set system with domain YY and family {FY:FF}\{F\cap Y: F\in\cal F\}.

Remark A set system is the same as a hypergraph.

Examples:

Usually X=FX=\bigcup\cal F, then we just mention F\cal F.

VC-dimension and shatter function

The goal is to define a measure of complexity of a set system. Machine learning intuition: understand which concept classes are learnable. When does Ockham's razor work well?

Definition (VC-dimension) [Vapnik, Chervonenkis, 71] Let F\cal F be a set system with domain XX. A set YXY\subseteq X is shattered by F\cal F if FY=2Y{\cal F}\cap Y=2^Y. The VC-dimension of F\cal F is the maximum size dd of a set YXY\subseteq X shattered by F\cal F, or \infty if arbitrarily large sets are shattered.

Definition (Shatter function). Let F\cal F be a set system on XX. Let

πF(m)=maxYX,YmFY.\pi_{\cal F}(m)=\max_{Y\subseteq X, |Y|\le m}|{\cal F}\cap Y|.

Examples

Observation If F\cal F has infinite VC-dimension then πF(m)=2m\pi_{\cal F}(m)=2^m for all mm.

Definition (Dual set system). Let F\cal F be a set system with domain XX. The dual system F\cal F^* is the system with domain F\cal F and family {x:xX}\{x^*:x\in X\}, where x={FF:xF}Fx^*=\{F\in\cal F: x\in F\}\subset\cal F.

Another view on set systems: bipartite graphs (L,R,E)(L,R,E), EL×RE\subseteq L\times R. (Multiplicities are ignored). Dual: (R,L,E1)(R,L,E^{-1}). In particular, (F)F(\cal F^*)^*\cong\cal F, up to removing twins (elements in the domain that belong to exactly the same sets).

πF\pi_{\cal F^*} is denoted π\pi^*_{\cal}, and is called the dual shatter function of F\cal F. VC(F)VC(\cal F^*) is denoted VC(F)VC^*(\cal F), and is called the dual VC-dimension of F\cal F.

VC(F)VC(\cal F^*) is the maximal number of sets in F\cal F such that every cell in the Venn diagram is occupied by some element of the domain. πF(m)\pi^*_{\cal F}(m) is the maximal number of occupied cells in the Venn diagram of m{\le}m sets from F\cal F.

Example If F\cal F is the family of half-planes in R2\mathbb R^2, then πF(m)\pi^*_{\cal F}(m) is the maximal number of regions into which mm half-planes can partition the plane. For example, πF(2)=4\pi^*_{\cal F}(2)=4 and πF(3)=7\pi^*_{\cal F}(3)=7.

Exercise Prove that πF(m)=(m+12)+1\pi^*_{\cal F}(m)={m+1\choose 2}+1.

Exercise If VC(F)dVC(\cal F)\le d then VC(F)<2d+1VC^*(\cal F)< 2^{d+1}.

In all examples, we have either a polynomial, or an exponential shatter function – nothing in between.

Sauer-Shelah-Perles Lemma. [Sauer (72), Shelah and Perles (72), Vapnik-Chervonenkis (71)] Let F\cal F be a set system on nn elements of VC-dimension at most dd. Then

F(n0)++(nd)=:Φd(n).|\cal F|\le {n\choose 0}+\ldots+{n \choose d}=:\Phi_d(n).
It follows that for all mm,

πF(m)Φd(m)(emd)kO(md).\pi_{\cal F}(m)\le \Phi_d(m)\le \left(\frac{em}d\right)^k\le O(m^d).

Note. This bound is tight. Consider the set system F\cal F of all d\le d-element subsets of an infinite set XX. Clearly πF(m)=Φd(m)\pi_{\cal F}(m)=\Phi_d(m). So the bound is tight.

Proof. Proof by "shifting", due to Alon (83) and Frankl (83): see Chernikov's notes. Inductive proof: see Geometric Discrepancy by Matousek, Lemma 5.9.

Exercise For every element vv in the domain, define a (partial) matching MvFM_v^{\cal F} on F\cal F, by

Mv={{A,B}:A,BF,AB={v}}.M_v=\{\{A,B\}:A,B\in \mathcal F, A\triangle B=\{v\}\}.
Show that if for every subset XX of the domain there is vXv\in X, with MvFXk|M_v^{\cal F|_X}|\le k, then FO(kn)|\mathcal F|\le O(k\cdot n).

Set systems in logical structures

Let MM be a logical structure and ϕ(xˉ;yˉ)\phi(\bar x;\bar y) be a formula. For a tuple bˉMyˉ\bar b\in M^{\bar y} denote

ϕ(M;bˉ):={aˉMxˉ:ϕ(aˉ,bˉ)}.\phi(M;\bar b):=\{\bar a\in M^{\bar x}: \phi(\bar a,\bar b)\}.

Consider the set system Fϕ\cal F_\phi with domain MxˉM^{\bar x} and sets ϕ(M;bˉ)\phi(M;\bar b), for bˉMyˉ\bar b\in M^{\bar y}.

Examples

ϕ(x;y1,y2)y1xy2.\phi(x;y_1,y_2)\equiv y_1\le x\le y_2.

Then Fϕ\cal F_\phi is the set system of closed intervals on R\mathbb R.

ϕ(x1,x2;y1,y2,y3)(x1y1)2+(x2y2)2p2.\phi(x_1,x_2;y_1,y_2,y_3)\equiv (x_1-y_1)^2+(x_2-y_2)^2\le p^2.

Then Fϕ\cal F_\phi is the set system of discs in R2\mathbb R^2.

ϕ(x1,x2;y1,y2,y3,y4)y1x1y2y3x2y4.\phi(x_1,x_2;y_1,y_2,y_3,y_4)\equiv y_1\le x_1\le y_2\land y_3\le x_2\le y_4.

Fϕ\cal F_\phi is the set of rectangles in R2\mathbb R^2.

A first-order formula is a formula using ,,,,¬\exists, \forall, \lor,\land,\neg, where the quantifiers range over elements of the domain.

Example ϕ(x,y)z.y=x+z2\phi(x,y)\equiv \exists z.y=x+z^2 is equivalent in (R,+,,,0,1)(\mathbb R,+,\cdot,\le,0,1) to xyx\le y. So \le is definable in (R,+,)(\mathbb R,+,\cdot). The relation y=exy=e^x is not definable in (R,+,)(\mathbb R,+,\cdot).

Example The formula δ1(x,y)(x=y)E(x,y)\delta_1(x,y)\equiv (x=y) \lor E(x,y) expresses that dist(x,y)1dist(x,y)\le 1, and the formula δ2(x,y)x=yE(x,y)z.E(x,z)E(z,y)\delta_2(x,y)\equiv x=y \lor E(x,y)\lor \exists z. E(x,z)\land E(z,y) expresses that dist(x,y)2dist(x,y)\le 2, and similarly we can define δr(x,y)\delta_r(x,y) expressing dist(x,y)rdist(x,y)\le r, for any fixed number rr. There is no first-order formula that expresses the property `xx and yy are at a finite distance' in all graphs. This can be expressed by a formula of monadic second-order logic (MSO).

Theorem (Shelah, 71) Let MM be a logical structure. The following conditions are equivalent:

If the above conditions hold, then MM is called NIP.

Example (R,+,,)(\mathbb R,+,\cdot,\le) is NIP. We use the result of Tarski:

Theorem (Tarski, 40) The structure (R,+,,,0,1)(\mathbb R,+,\cdot,\le,0,1) has quantifier elimination: every first-order formula ϕ(x1,,xk)\phi(x_1,\ldots,x_k) is equivalent to a formula without quantifiers.

Such formulas are just boolean combinations of sets of the form:

{(a1,,ak)Rk:p(x1,,xk)0}\{(a_1,\ldots,a_k)\in \mathbb R^k: p(x_1,\ldots,x_k)\ge 0\}

or

{(a1,,ak)Rk:p(x1,,xk)=0}\{(a_1,\ldots,a_k)\in \mathbb R^k: p(x_1,\ldots,x_k)= 0\}

where p(x1,,xk)p(x_1,\ldots,x_k) is a polynomial with integer coefficients, such as x12+x225x_1^2+x_2^2\le 5.

Corollary (Tarski, 30) For every first-order formula ϕ(x;y1,,yk)\phi(x;y_1,\ldots,y_k) and parameters b1,,bkRb_1,\ldots,b_k\in\mathbb R, the set ϕ(R;b1,,bk)R\phi(\mathbb R;b_1,\ldots,b_k)\in\mathbb R is a union of at most kϕk_\phi intervals, where kϕk_\phi depends only on ϕ\phi (and not on b1,,bkb_1,\ldots,b_k).

Proof. We may assume that ϕ\phi is a boolean combination of \ell sets as above. Let dd be the maximum degree of the polynomials. Each polynomial of degree dd with one free variable defines a set that is a union of at most dd intervals. A boolean combination of kk such sets is a union of at most k:=dk:=\ell d intervals.

Corollary The structure (R,,+,,0,1)(\mathbb R,\cdot,+,\le,0,1) is NIP.

Proof. The family of all sets that are unions of kk intervals cannot shatter a set of 2k+12k+1 points. The conclusion follows from the result of Shelah.

Definition A structure MM with a total order << and other relations/functions is o-minimal if for every first-order formula ϕ(x;y1,,yk)\phi(x;y_1,\ldots,y_k) and parameters b1,,bkb_1,\ldots,b_k, the set ϕ(M;b1,,bk)\phi(M;b_1,\ldots,b_k) is a union of finitely many intervals.

It is known that this implies that then it is a union of at most nϕn_\phi intervals, for some nϕn_\phi that depends only on ϕ\phi.

Corollary Every o-minimal structure is NIP.

Examples of o-minimal structures

Other examples of NIP structures

Classes of structures

Let C\cal C be a class of logical structures. For example, the class of all planar graphs. A formula ϕ(xˉ;yˉ)\phi(\bar x;\bar y) defines a set system FϕM{\cal F}^M_{\phi} in each structure MCM\in\cal C. We say that ϕ\phi has VC-dimension d{\le}d in C\cal C if each of the set systems FϕM{\cal F}^M_\phi has VC-dimension at most d{\le}d. The shatter function πϕ,F(m)\pi_{\phi,\cal F}(m) is defined as the maximum of πFϕM\pi_{{\cal F}^M_\phi}, over all MCM\in\cal C. A class C\cal C is NIP if every formula ϕ(xˉ;yˉ)\phi(\bar x;\bar y) has bounded VC-dimension on C\cal C.

Example Let C\cal C a class of graphs and ϕ(x,y)E(x,y)\phi(x,y)\equiv E(x,y). Then πE,C(m)\pi_{E,\cal C}(m) is the following quantity: the maximum, over all planar graphs GG and sets AV(G)A\subseteq V(G) with Am|A|\le m, of the number of distinct vertices vV(G)v\in V(G) with different neighborhoods in AA. If C\cal C is the class of bipartite graphs, then πE,C(m)=2m\pi_{E,\cal C}(m)=2^m. If C\cal C is the class of planar graphs, then πE,C(m)=O(m)\pi_{E,\cal C}(m)=O(m).

Fact The class C\cal C of planar graphs is NIP. [Podewski, Ziegler, 1978] Moreover, for every fixed first-order formula ϕ(x;y)\phi(x;y), we have πϕ,C(m)=O(m)\pi_{\phi,\cal C}(m)=O(m)

This holds more generally for all classes C\cal C with bounded expansion [Pilipczuk, Siebertz, Toruńczyk, 2018] and for classes of bounded twin-width [Przybyszewski, 2022]. Classes with bounded expansion include classes of bounded tree-width, classes of graphs that can be embedded in a fixed surface, classes of graphs with bounded maximum degree, and classes that exclude a fixed minor.

They are contained in nowhere dense classes. πϕ,C(m)=O(m1+ε)\pi_{\phi,\cal C}(m)=O(m^{1+\varepsilon}) holds for every nowhere dense class C\cal C and fixed ε>0\varepsilon>0 [PST, 2018].

VC-density

Definition (VC-density). The VC-density of a set system F\cal F is the infimum of all rRr\in\mathbb R such that πF(m)=O(mr)\pi_{\cal F}(m)=O(m^r). Similarly, we define the VC-density of a formula ϕ(xˉ;yˉ)\phi(\bar x;\bar y) in a class C\cal C of structures.

Example Every formula ϕ(xˉ;y)\phi(\bar x;y) (with yy singleton) has VC-density 11 on the class of planar graphs, and more generally, on all nowhere dense classes.

Theorem (Assouad, 1983) The VC-density of any set system F\cal F is either 00, or a real number 1{\ge }1. For every real α1\alpha\ge 1 there is a set system F\cal F with VC-density α\alpha.

We will construct a family with VC-density 3/23/2. See Chernikov's notes, Example 2.7. For this we use the following.

Theorem (Kovari, Sos, Turan, 1954) Every K2,2K_{2,2}-free bipartite graph with n+nn+n vertices has O(n3/2)O(n^{3/2}) edges.

Example Fix a prime number pp. For every power qq of pp there is a field Fq\mathbb F_q with qq elements. Consider the set system Fq\cal F_q where:

  1. πFq(m)=O(m3/2)\pi_{\cal F_q}(m)=O(m^{3/2}). Pick an mm-element subset AA. The restriction FqA\cal F_q\cap A consists of:

The number of pairs is the number of edges in the bipartite graph GAG_A whose parts are the points pAp\in A, and the lines A\ell \in A, and edges denote incidence. The graph GAG_A is K2,2K_{2,2}-free, so it has at most O(m3/2)O(m^{3/2}) edges. Altogether, FqA1+m+O(m3/2)=O(m3/2)|{\cal F}_q\cap A|\le 1+m+O(m^{3/2})=O(m^{3/2}).

On the other hand, Fqq3|{\cal F}_q|\ge q^3 and there are at most 2q22q^2 points and lines. Therefore, FqΩ(points+lines)3/2|{\cal F}_q|\ge \Omega(|points|+|lines|)^{3/2}.

VC-theorem and PAC learning

Let DD be a fixed domain. For a multiset SDS\subset D and a set FDF\subset D, write AvS(F)Av_S(F) to denote the average number of points in SS that belong to DD,

AvS(F)=#{sS:SF}SAv_S(F)=\frac {\#\{s\in S: S\in F\}}{|S|}

Definition (ε\varepsilon-approximation). Fix a probability measure μ\mu on DD. A multiset SS is an ε\varepsilon-approximation for μ\mu on a (measurable) set FDF\subset D if

μ(F)AvS(F)ε. |\mu(F)-Av_S(F)|\le \varepsilon.

A multiset SS is an ε\varepsilon-approximation for μ\mu on a set family F\cal F with domain FF, if SS is an ε\varepsilon-approximation for μ\mu on FF, for each FFF\in\cal F.

Recall the weak law of large numbers:

Theorem (Weak law of large numbers) Fix a probability measure μ\mu on a domain DD and a measurable set FDF\subset D. For every ε>0\varepsilon>0 and number nn, a randomly and independently selected sequence SS of nn-elements of DD is an ε\varepsilon-approximation for μ\mu on FF with probability at least 114nε21-\frac 1 {4n\varepsilon^2}.

The VC-theorem is a uniform version, for an entire family F\cal F of small VC-dimension. We assume here that the domain is finite. This assumption can be removed at the cost of some additional assumptions.

Theorem (VC-theorem) Fix a probability measure P\mathbb P on a finite domain DD and a family F\cal F of subsets of DD. For every ε>0\varepsilon>0 and number nn, a randomly and independently selected sequence of nn-elements of DD is an ε\varepsilon-approximation for P\mathbb P on F\cal F with probability at least

18πF(n)exp(nε232).1-8\pi_{\cal F}(n)\cdot \exp({-\frac {n\varepsilon^2}{32}}).

For the proof, see e.g. Chernikov's lecture notes.

Corollary For every fixed dNd\in\N and ε>0\varepsilon>0 there is a number N(d,ε)N(d,\varepsilon) such that if F\cal F has VC-dimension dd and P\mathbb P is a probability distribution on DD, then there exists an ε\varepsilon-approximation for P\mathbb P on F\cal F of size at most N(d,ε)N(d,\varepsilon). Moreover, we can take

N(d,ε)=O(dε2log(dε)).N(d,\varepsilon)=O(\frac{d}{\varepsilon^2}\log \left(\frac d\varepsilon\right)).

Proof of Corollary. By Sauer-Shelah-Perles we have that πF(n)O(nd)\pi_{\cal F}(n)\le O(n^d). The key point is that in the value in the VC-theorem,

O(nd)/exp(nε232),O(n^d)/ \exp({\frac {n\varepsilon^2}{32}}),

converges rapidly to 00 as nn\rightarrow\infty. In particular, we can find N(d,ε)N(d,\varepsilon) large enough so that this number is smaller than any fixed constant c>0c>0. A more careful analysis shows that O(dε2log(dε))O(\frac{d}{\varepsilon^2}\log \left(\frac d\varepsilon\right)) suffices. See Chernikov's lecture notes. \square

Let μ\mu be a probability distribution on (a σ\sigma-algebra on) DD and let F\cal F be a system of (measurable) subsets of DD.

For ε>0\varepsilon>0, a set SDS\subset D is called a ε\varepsilon-net for μ\mu on F\cal F, if every set FFF\in\cal F with μ(F)ε\mu(F)\ge \varepsilon has a nonempty intersection with SS.

Lemma Every ε\varepsilon-approximation is an ε\varepsilon-net. Proof. For every FFF\in \cal F, if SS has an empty intersection with FF then AvF(S)=0Av_F(S)=0. If SS is an ε\varepsilon-approximation, this implies that μ(F)ε\mu(F)\le \varepsilon. \square

Corollary For every fixed dNd\in\N and ε>0\varepsilon>0 there is a number N(d,ε)N(d,\varepsilon) such that if F\cal F has VC-dimension dd and P\mathbb P is a probability distribution on DD, then there exists an ε\varepsilon-net for P\mathbb P on F\cal F of size at most N(d,ε)N(d,\varepsilon). In fact, a random (with probability P\mathbb P) sample of N(d,ε)N(d,\varepsilon) elements of DD is an ε\varepsilon-net, with probability at least 1ε1-\varepsilon. Moreover, we can take

N(d,ε)=O(dε2)log(dε).N(d,\varepsilon)=O(\frac{d}{\varepsilon^2})\log \left(\frac d\varepsilon\right).

In fact, for ε\varepsilon-nets (as opposed to ε\varepsilon-approximations) one can improve the bound above and get

N(d,ε)=O(dε)log(dε).N(d,\varepsilon)=O(\frac{d}{\varepsilon})\log \left(\frac d\varepsilon\right).

Although we will use this bound in the sequel, the improvement over the previous bound will not be relevant to us.

PAC learning

A concept is a function f:D[0,1]f: D\to[0,1]. A concept class is a family of concepts on DD.

We restrict to {0,1}\{0,1\}-valued functions (aka sets), which can be seen as a classification problem. So now concept classes are just set families.

A labeled sample is a multiset SDS\subset D with a partition S=S+SS=S_+\cup S_-. Denote by SN(D)S_N(D) the set of all labeled samples of size NN. A learning map is a function ff that maps every NN-sample SSN(D)S\in S_N(D) to a generalization f(S)Df(S)\subset D. Ideally, f(S)f(S) should be consistent with SS, that is, S+=Sf(S)S_+=S\cap f(S) and S=Sf(S)S_-=S\setminus f(S), but this does not need to happen (it is even desirable to allow errors in the labelling of SS).

Intuition See the papaya intuition from Understanding Machine Learning.

Overfitting We pick f(S)f(S) to be S+S_+. To avoid this, we restrict to some family F\cal F of concepts that we are willing to consider, and require that the range of ff should be contained in F\cal F.

Empirical Risk Minimization is the learning approach that we pick f(S)f(S) to be any concept CFC\in\cal F that minimizes the number of errors on the training set. That is, the number of points in SS such that belong to S+CS_+\triangle C is the smallest possible. Here F\cal F is a predefined set family (representing our knowledge of the world).

PAC learnability

Say that F\cal F is PAC-learnable if for every ε,δ>0\varepsilon,\delta>0 there exists N=N(ε,δ)N=N(\varepsilon,\delta) and a learning map f:SN(D)2Df:S_N(D)\to 2^D such that for every concept CFC\in \cal F and probability distribution μ\mu on DD, if we select NN elements of DD independently with distribution μ\mu each and label it with respect to CC, obtaining a labeled sample S=S+SS=S_+\uplus S_-, then we will have that μ(f(S)C)ε\mu(f(S)\triangle C)\le \varepsilon with probability at least 1δ1-\delta.

If moreover ff always returns concepts in F\cal F, then we say that F\cal F is properly PAC learnable. We say that F\cal F is consistently PAC learnable if every function f:SN(D)Ff:S_N(D)\to \cal F such that f(S)Ff(S)\in \cal F is any concept that agrees with SS, works.

Theorem (Fundamental theorem of PAC learning). Let F\cal F be a concept class. The following conditions are equivalent:

Remark There are some untold hidden assumptions in this statement. Either we should assume some measurability conditions, similar to the ones in the VC theorem for infinite domains, or we should assume that the domain is finite. Then the equivalence of the two implications in the theorem should be read as follows. Bottom-up implication: the sample complexity N(ε,δ)N(\varepsilon,\delta) can be bounded in terms of ε,δ\varepsilon,\delta and on the VC-dimension of the family F\cal F. Top-down implication: the VC-dimension of F\cal F can be bounded in terms of the function N(,)N(\cdot,\cdot). More precisely, we can bound it in terms of N(13,13)N(\frac 1 3,\frac 1 3).

Proof For the bottom-up implication, we may assume for simplicity (by taking the min) that δ=ε\delta=\varepsilon. Let N=N(d,ε)N=N(d,\varepsilon) be the number obtained from the ε\varepsilon-net Corollary.

The learning map f:SN(D)Ff:S_N(D)\to \cal F be defined so that f(S+,S)f(S_+,S_-) is any concept in CFC\in\cal F that contains S+S_+ and is disjoint from SS_-. To prove that this learning map ff has the required properties, we consider the set system

FC:={FC:FF},{\cal F}\triangle C:=\{F\triangle C: F\in \cal F\},

where \triangle denotes the symmetric difference of sets. It was shown in the exercises that FC\cal F\triangle C has the same VC-dimension as F\cal F. By the ε\varepsilon-net corollary, a random sample of NN elements of DD is an ε\varepsilon-net for the family FC\cal F\triangle C, with probability at least 1ε1-\varepsilon. This easily implies that the learning map ff has the required property.

The top-down implication was shown in the exercices. See also Chernikov's notes.

Transversals and Helly-type properties

Transversals

Definition Fix a set system F\cal F on a domain DD.

Clearly τ(F)ν(F)\tau(\cal F)\ge \nu(\cal F). It is NP-hard to compute τ(F)\tau(\cal F) and ν(F)\nu(\cal F). For this reason, we sometimes use the following, approximate notions.

Definition Fix a set system F\cal F on a finite domain DD.

xFf(x)1.\sum_{x\in F}f(x)\ge 1.

Its total weight is xDf(x)\sum_{x\in D}f(x).

FF:xFg(F)1.\sum_{F\in{\cal F}: x\in F}g(F)\le 1.

Its total weight is FFg(F)\sum_{F \in\cal F}g(F)

The following lemma is straightforward.

Lemma For every set system F\cal F on a finite domain DD we have:

τ(F)ν(F)\tau^*(\cal F)\ge \nu^*(\cal F)

The following fact is a consequence (or restatement) of the duality theorem for linear programs. It is an easy consequence of Farkas' lemma.

Fact For every set system F\cal F on a finite domain DD we have

τ(F)=ν(F).\tau^*(\cal F)=\nu^*(\cal F).

Moreover, this number is rational, computable in polynomial time, and is attained both by a fractional transversal f:D[0,1]f:D\to [0,1] with rational values, and by a fractional packing g:F[0,1]g:\cal F\to[0,1] with rational values.

Corollary For every set system F\cal F on a finite domain DD with VCdim(F)=d\text{VCdim}(\cal F)=d, we have:

τ(F)O(d)τ(F)logτ(F).\tau(\cal F)\le O(d)\tau^*(\cal F)\log \tau^*(\cal F).

Proof Follows from the existence of ε\varepsilon-nets of size O(d)1εlog(1ε)O(d)\frac 1 \varepsilon\log (\frac 1 \varepsilon). Namely, pick a fractional transversal f:V[0,1]f:V\to [0,1] of total weight rr, and normalize it to get a measure μ\mu with μ(v)=f(v)/r\mu(v)=f(v)/r. For a set FFF\in\cal F we have that xFf(v)1\sum_{x\in F}f(v)\ge 1 since ff is a fractional transversal, so μ(F)1/r\mu(F)\ge 1/r.

Let ε=1/r\varepsilon=1/r. By the ε\varepsilon-net theorem there is an ε\varepsilon-net TT for μ\mu on F\cal F of size O(d)rlogrO(d)\cdot r\log r. Since μ(F)1/r=ε\mu(F)\ge 1/r=\varepsilon for all FFF\in\cal F and TT is an ε\varepsilon-net, it follows that TT intersects every FFF\in\cal F.

Helly-type theorems

Fact (Helly's Theorem)Fix d1d\ge 1.

Let F\cal F be a finite family of convex sets in Rd\mathbb R^d. If any d+1d+1 sets in F\cal F have a point in common, then F\cal F has a point in common.

Definition A family G\cal G has the kk-Helly property if for every finite subfamily FG\cal F\subset \cal G, if every kk subsets of F\cal F have a point in common, then F\cal F has a point in common.

For instance, the family of convex subsets of Rd\mathbb R^{d} has the (d+1)(d+1)-Helly property.

Families of bounded VC-dimension don't necessarily have the kk-Helly property, for any finite kk. For example, for an infinite set VV, the family of all sets of the form V{a}V-\{a\}, for aVa\in V, has VC-dimension 22, but does not have the kk-Helly property, for all kk.

However, families of bounded VC-dimension satisfy a related property.

Say that a family F\cal F has the (p,q)(p,q)-property if among every pp subsets of F\cal F, some qq among them have a point in common.

For example, the kk-Helly property is the same as the (k,k)(k,k)-property. Note that the conclusion in Helly's theorem is that F\cal F has a transversal of size 11, that is τ(F)1\tau(\cal F)\le 1. The following is a generalization.

Fact (Alon,Kleitman). Let p,q,dp,q,d be integers with pqd+1p\ge q\ge d+1. Then there is a number N=N(p,q,d)N=N(p,q,d) such that for every finite family F\cal F of convex subsets of Rd\mathbb R^d, if F\cal F has the (p,q)(p,q)-property, then τ(F)N\tau({\cal F})\le N.

Now this statement generalizes to families of bounded VC-dimension, and can be derived from the proof of Alon and Kleitman, as observed by Matousek.

Fact (Alon,Kleitman and Matousek). Let p,q,dp,q,d be integers with pqd+1p\ge q\ge d+1. Then there is a number N=N(p,q,d)N=N(p,q,d) such that for every finite family F\cal F with VC-dim(F)<d+1\text{VC-dim}^*(\cal F)<d+1, if F\cal F has the (p,q)(p,q)-property, then τ(F)N\tau(\cal F)\le N.

Recall that VC-dim(F)\text{VC-dim}^*(\cal F) is the dual VC-dimension of F\cal F, defined as the VC-dimension of F\cal F^*, Equivalently, it is the maximal number kk of subsets in F\cal F such that every cell in the Venn diagram of those sets is nonempty (occupied by some element in the domain).

We reprove a duality result which is a corollary of the (p,q)(p,q)-theorem. We present a self-contained proof.

Let EA×BE\subset A\times B be a binary relation. For aAa\in A denote E(a):={bB(a,b)E}E(a):=\{b\in B|(a,b)\in E\} and for bBb\in B denote E(b):={aA(a,b)E}E^*(b):=\{a\in A|(a,b)\in E\}. Define the VC-dimension of a binary relation EE as the maximum of the VC-dimension of two set systems:

(B,{E(a)aA})and(A,{E(b)bB}).(B,\{ E(a)|a\in A\})\qquad\text{and}\qquad (A,\{ E^*(b)|b\in B\}).

Here's the duality result.

Theorem (Corollary of the (p,q)(p,q)-theorem) For every dd there is a number kO(d)k\le O(d) with the following property. Let EA×BE\subset A\times B be a binary relation with VCdim(E)d\textup{VCdim}(E)\le d, with A,BA,B finite. Then at least one of two cases holds:

We first state a corollary of von Neumann's minimax theorem (also of Farkas' lemma, the Hahn-Banach theorem, or of the strong duality theorem for linear programming).

Theorem (Minimax Theorem) Let EA×BE\subset A\times B be a binary relation with A,BA,B finite, and let αR\alpha\in \R. Then exactly one of the following two cases holds:

  1. there is some probability distribution ν\nu on BB such that ν(E(a))α\nu(E(a))\ge \alpha holds for all aAa\in A
  2. there is some probability distribution μ\mu on AA such that μ(E(b))<α\mu(E^*(b))< \alpha holds for all bBb\in B.

Proof of Corollary of the (p,q)(p,q)-theorem Set α:=1/2\alpha:= 1/2 and ε:=1/3\varepsilon:= 1/3; the point is that 0<α±ε<10<\alpha\pm\varepsilon<1. Fix dNd\in\N, and let kk be the number from the VC-theorem, with kO(dε2log(1ε))=O(d)k\le O(\frac{d}{\varepsilon^2}\log(\frac {1}\varepsilon))=O(d). By the Minimax Theorem applied to α=1/2\alpha= 1/2, one of the two cases below holds.

Case 1: There is some probability distribution ν\nu on BB such that ν(E(a))1/2\nu( E(a))\ge 1/2 holds for all aAa\in A.

By the VC-theorem applied to ε=1/3\varepsilon= 1/3, there is a multiset BBB'\subset B with Bk|B'|\le k such that

AvB(E(a))ν(E(a))1/3for all aA.|\textup{Av}_{B'}( E(a))-\nu( E(a))|\le 1/3\qquad\text{for all }a\in A.

Pick aAa\in A. Since ν(E(a))1/2\nu( E(a))\ge 1/2, we have that AvB(E(a))>0\textup{Av}_{B'}( E(a))>0, so BE(a)B'\cap E(a) is nonempty. Therefore, the first condition in the statement is satisfied.

Case 2: There is some probability distribution μ\mu on AA such that μ(E(b))<1/2\mu(E^*(b))< 1/2 holds for all bBb\in B.

By the VC-theorem again, there is a multiset AAA'\subset A with Ak|A'|\le k such that

AvA(E(b))μ(E(b))1/3for all bB.|\textup{Av}_{A'}(E^*(b))-\mu(E^*(b))|\le 1/3\qquad\text{for all }b\in B.

Pick bBb\in B. Since μ(E(b))<1/2\mu(E^*(b))< 1/2, we have that AvA(E(b))<1\textup{Av}_{A'}(E^*(b))<1, so AA' is not contained in E(b)E^*(b). Therefore, the second condition in the statement is satisfied. \blacksquare

Sample compression schemes

This is based on the original paper by Moran and Yehudayoff.

Fix a set family F\cal F on a finite domain DD. Let Sd(F)S_d(\cal F) denote the set of labeled samples of size dd, that is, pairs S:=(S+,S)S:=(S_+,S_-) of multisets of elements of DD with S++S=d|S_+|+|S_-|=d, such that there exists some FFF\in \cal F containing S+S_+ and disjoint from SS_-. For a labeled sample SS and set ZDZ\subset D, let S[D]S[D] denote the sample (S+D,SD)(S_+\cap D,S_-\cap D). If SS and TT are two samples, we say that TT is a sub-sample of SS if T+S+T_+\subset S_+ and TST_-\subset S_-.

Let Sω(F)S_\omega(\cal F) denote the union of all Sd(F)S_d(\cal F), for d0d\ge 0.

Fix a family F\cal F of VC-dimension dd, for some fixed dd.

A sample compression scheme for F\cal F, of size kk, consists of two mappings: a compression mapping, and a reconstruction mapping, defined as follows.

Compression

The compression map maps each Y:=(Y+,Y)Sω(F)Y:=(Y_+,Y_-)\in S_\omega(\cal F) to a pair (Z,w)(Z,w), where ZYZ\subset Y is a sub-sample of YY of size kk and w{0,1}kw\in \{0,1\}^k is a bit-sequence describing some auxiliary information.

Reconstruction

A reconstruction function maps a given pair (Z,w)(Z,w), where ZSk(F)Z\in S_k(\cal F) and w{0,1}kw\in \{0,1\}^k, to a concept C2DC\subset 2^D.

We require that the composition of the two mappings – start with YSω(F)Y\in S_\omega(\cal F), compress it to some (Z,w)(Z,w), reconstruct some concept CC – returns a concept C2DC\in 2^D that contains Y+Y_+ and is disjoint with YY_-.

Recall that VCdim(F)\textup{VCdim}^*(\cal F) is the dual VC-dimension of F\cal F.

Theorem Let F\cal F be a set system on a finite domain, and d=VCdim(F)d=\textup{VCdim}(\cal F) and d=VCdim(F)d^*=\textup{VCdim}^*(\cal F). Then F\cal F has a sample compression scheme of size O(dd)O(d\cdot d^*).

Proof By the VC-theorem, there is s=O(d)s=O(d) and a learning map f:Ss(F)Ff:S_s(\cal F)\to \cal F such that for all FFF\in\cal F and μ\mu on DD there is Zsupp(μ):={vDμ(v)>0}Z\subset\text{supp}(\mu):=\{v\in D|\mu(v)>0\} with Zs|Z|\le s such that

μ(Ff(FZ,FZ))1/3.\mu(F\triangle f(F\cap Z,F-Z))\le 1/3.

Let (Y+,Y)Sω(F)(Y_+,Y_-)\in S_\omega(\cal F), and let Y=Y+YY=Y_+\cup Y_-.

The key lemma is the following.

Lemma There are TO(d)T\le O(d^*) sets Z1,,ZTY+YZ^1,\ldots,Z^T\subseteq Y_+\cup Y_-, each of size at most ss, such that for every xY+Yx\in Y_+\cup Y_- we have:

xY+    #{t{1,,T}xf(Y[Zi])}>T/2x\in Y_+\qquad\iff \qquad \#\{t\in \{1,\ldots,T\}\mid x\in f(Y[Z^i]) \}>T/2

With this lemma, obtaining the compression scheme is straightforward, as we now describe.

Compression We assume an arbitrary total order on DD is fixed. To compress (Y+,Y)(Y_+,Y_-), we map it to (Z,w)(Z,w) where ZZ is the union of the sets as in the statement of the lemma, and ww is a bit-sequence of length TT, whose ii-th bit indicates whether the ii-th element of ZZ (in the fixed order on DD) belongs to ZiZ^i. This needs to be fixed slightly

Decompression Given (Z,w)(Z,w), we first compute the sets Z1,,ZTZ^1,\ldots,Z^T, using ZZ and the bit-sequence ww. We then define a concept CC as the set of all xDx\in D such that

#{t{1,,T}xf(Y[Zi])}>T/2.\qquad \#\{t\in \{1,\ldots,T\}\mid x\in f(Y[Z^i]) \}>T/2.

It follows immediately from the lemma that this is a sample compression scheme of size O(dd)O(dd^*).

We now prove the lemma.

Proof of lemma Denote

G:=GY={f(Y+Z,YZ)ZY,Zs}F.{\cal G}:={\cal G}_Y=\{f(Y_+\cap Z,Y_-\cap Z)\mid Z\subset Y, |Z|\le s\}\subseteq \cal F.

For a concept CGC\in \cal G and element xY+Yx\in Y_+\cup Y_-, say that CC is correct on xx if xC    xY+x\in C\iff x\in Y_+ holds.

By definition of ff, for every probability distribution μ\mu on YY there is GGG\in \cal G such that

μ({xYG is correct on x})2/3.\mu(\{x\in Y\mid \text{$G$ is correct on $x$}\})\ge 2/3.

By the Minimax theorem, there is a probability distribution ν\nu on G\cal G such that for every xYx\in Y we have

ν({GGG is correct on x)2/3.\nu(\{G\in {\cal G}\mid \text{$G$ is correct on $x$})\ge 2/3.

By the VC-theorem, there is an 1/81/8-approximation for ν\nu on G\cal G^*: a multiset HG\cal H\subset \cal G of size TO(d)T\le O(d^*) such that for all xYx\in Y we have:

#{HHH is correct on x}#Hν({GGG is correct on x})182318>12.\frac{\#\{H\in{\cal H}\mid \text{$H$ is correct on $x$} \}}{\#\cal H}\ge \nu(\{G\in{\cal G}\mid \text{$G$ is correct on $x$}\})-\frac 1 8\ge \frac 23-\frac 18>\frac 12.

Taking Z1,,ZTZ^1,\ldots,Z^T to be the elements of H\cal H, this proves the lemma, and concludes the theorem.\blacksquare

Haussler's packing theorem

Fix a domain DD and a probabilistic distribution on DD. For two sets X,YDX,Y\subseteq D, write distP(X,Y)\textup{dist}_{\mathbb P}(X,Y) to denote the measure P[XY]\mathbb P[X\triangle Y] of the symmetric difference of XX and YY. This defines a metric on 2D2^D.

Fix ε>0\varepsilon>0. We are intersted in the problem of packing a maximal number of pairwise disjoint (closed) balls of radius ε\varepsilon in this metric. Equivalently, of finding the maximal size of a subset of a 2ε2\varepsilon-separated subset SF\cal S\subseteq \cal F, that is, a set S\cal S of elements with mutual distance larger than 2ε2\varepsilon.

Example. Suppose we are packing discs of radius 0<ε<10<\varepsilon<1 with center in the n×nn\times n square in R2\mathbb R^2. For a disc DD with center in [0,n]×[0,n][0,n]\times [0,n], the area of D([0,n]×[0,n])D\cap ([0,n]\times [0,n]) is at least πε2/4=Ω(ε2)\pi\varepsilon^2/4=\Omega(\varepsilon^2). Thus, we can pack at most O((n/ε)2)O((n/\varepsilon)^2) disjoint discs.

Haussler's packing theorem says that for set systems F\cal F of VC-dimension dd, the behaviour is similar as in the dd-dimensional Euclidean space.

Theorem Let d>1d>1 and CC be constants, and let F\cal F be a set system on a finite domain DD whose primal shatter function satisfies πF(m)Cmd\pi_{\cal F}(m)\le Cm^d for all mNm\in\N. Let 0<ε<10<\varepsilon<1 and fix a probability distribution on DD. If F\cal F is ε\varepsilon-separated then FO(1/εd)|{\cal F}|\le O(1/\varepsilon^d).

Corollary Let F\cal F be a set system on a finite domain DD and d=VCdim(F)d=\textup{VCdim}(\cal F). Let kk be a positive integer. If ABk|A\triangle B|\ge k for all A,BFA,B\in\cal F then FO((n/k)d)|{\cal F}|\le O((n/k)^d).

This follows from the theorem applied to ε=k/n\varepsilon=k/n and the uniform distribution on DD, by the Sauer-Shelah-Perles lemma. Conversely, for k=1k=1, the Corollary recovers the SSP lemma, although with worse hidden constants.

Our proof of Haussler's theorem follows a recent proof by Kupavskii and Zhivotovskiy.

We will prove the following lemma, which quickly yields the theorem.

Lemma (\ast). Fix 0<δ<10<\delta<1 and ε>0\varepsilon>0. Let F\cal F be an ε\varepsilon-separated set system on DD, and let μ\mu be a probability distribution on DD. Fix m=2VCdim(F)εδm=\frac {2\text{VCdim}(\cal F)}{\varepsilon\delta} and let SS be a sample of mm independently chosen elements of DD according to the distribution μ\mu. Then

#FES[#FS]1δ.\#{\cal F}\le \frac{\mathbb E_S[\#\cal F|_S]}{1-\delta}.

In particular, if πF(m)Cmd\pi_{\cal F}(m)\le Cm^d for all mm, we have that

#FπF(m)1δ11δCmd=C11δ(2VCdim(F)εδ)d=C1(1δ)δd(2VCdim(F))dεd\#{\cal F}\le \frac{\pi_{\cal F}(m)}{1-\delta}\le \frac{1}{1-\delta}\cdot Cm^d=C\frac{1}{1-\delta}\left(\frac{2\text{VCdim}(\cal F)}{\varepsilon\delta}\right)^d=C\frac{1}{(1-\delta)\delta^d}\cdot \frac {(2\text{VCdim}(\cal F))^d}{\varepsilon^{d}}

holds for all 0<δ<10<\delta<1. The minimum, over 0<δ<10<\delta<1 of 1(1δ)δd\frac{1}{(1-\delta)\delta^d} is attained for δ=d/(d+1)\delta=d/(d+1), and is at most e(d+1)e(d+1). Hence we get

#FCe(d+1)(2VCdim(F)ε)dO(1/εd),\#{\cal F}\le Ce(d+1) \left(\frac{2\text{VCdim}(\cal F)}{\varepsilon}\right)^d\le O(1/\varepsilon^d),

which proves the theorem. It therefore remains to prove the lemma.

For a set SDS\subseteq D and FFF\in\cal F denote by FSF|_S the labeled sample FS:=(S+,S)=(FS,SF)F|_S:=(S_+,S_-)=(F\cap S,S-F).

Recall that, given ε>0\varepsilon>0, the VC-theorem implies the existence of a learning function f ⁣:Sm(F)2Df\colon S_m(\cal F)\to 2^D, where m=O(d)1εlog1εm=O(d)\frac {1}{\varepsilon}\log\frac{1}{\varepsilon}, such that for every FFF\in\cal F and random sample SS of mm elements of DD chosen independently with a given probability distruction μ\mu, we have that

μ[f(FS)F]<ε.\mu[f(F|_S)\triangle F]<\varepsilon.

Using this, it is easy (and instructive) to prove Lemma (\ast) with a weaker bound, where instead of m=2VCdim(F)εδm=\frac {2\text{VCdim}(\cal F)}{\varepsilon\delta} we use m=O(d)1εlog1εm=O(d)\frac 1{\varepsilon}\log \frac 1\varepsilon. This would would yield the bound

#FO(1εlog1ε)d,\#{\cal F}\le O\left(\frac 1{\varepsilon}\log \frac 1\varepsilon\right)^d,

which was obtained originally by Dudley. [To obtain Dudley's bound, proceed as in the proof Lemma (\ast) presented below, but instead of the learning function obtained in the proposition below, use the learning function obtained from the VC-theorem.]

In order to obtain the improved bound as stated in Lemma ()(\ast), we need a learning function with the number of sample elements of the form m=Oδ(1ϵ)m=O_\delta(\frac 1 \epsilon), rather than m=Oδ(1ϵlog1ϵ)m=O_{\delta}(\frac 1\epsilon\log \frac 1\epsilon). This might come at a worse dependency on δ\delta, which we don't care about, because we consider δ\delta to be fixed, say δ=12\delta=\frac 1 2. This is achieved by the following proposition.

Proposition Let F\cal F be a set system on a domain DD. Fix m=2VCdim(F)εδm=\frac{2\text{VCdim}(\cal F)}{\varepsilon\delta}. There is a "learning function" f ⁣:Sm(F)2Df\colon S_m(\cal F)\to 2^D such that the following holds. For every FFF\in\cal F and random sample SS of mm elements of DD chosen independently with a given probability distribution μ\mu, we have that

μ[f(FS)F]<ε2\mathbb \mu[f(F|_S)\triangle F]<\frac\varepsilon 2

holds with probability at least 1δ1-\delta over the sample.

The following lemma is key. Given a set system F\cal F, consider its Hamming graph HFH_{\cal F}, with vertices F\cal F and edges connecting two sets F,GFF,G\in \cal F if and only if FG=1|F\triangle G|=1. The edge density of a graph (V,E)(V,E) is E/V|E|/|V|.

Lemma The Hamming graph HFH_{\cal F} of F\cal F has edge density at most VCdim(F)\textup{VCdim}(\cal F).

The proof of the lemma is by repeating the proof of the Sauer-Shelah-Perles lemma and observing that it also yields this conclusion. See also Geometric Discrepancy by Matousek, Lemma 5.15.

This implies that the same holds for every induced subgraph of HFH_{\cal F}.

Fact Let GG be a graph such that every induced subgraph HH of GG has edge density at most dd. Then there is an orientation of the edges of GG such that every node has outdegree at most dd.

Remark This is very easy to prove with 2d2d instead of dd: repeatedly remove a vertex vv of degree at most 2d2d, and orient its edges towards the remaining vertices. This can be improved to dd using Hall's marriage theorem (see Alon, Tarsi '92).

Corollary There is an orientation of the edges of HFH_{\cal F} such that every node has outdegree at most VCdim(F)\textup{VCdim}(\cal F).

Proof of Proposition Let d0=VCdim(F)d_0=\text{VCdim}(\cal F). Fix a probability distribution μ\mu on DD. We will show that there is a function f ⁣:Sm(F)2Df\colon S_m(\cal F)\to 2^D such that

ES[μ(f(FS)F)]<d0m\mathbb E_S\left[ \mu(f(F|_S)\triangle F)\right]<\frac {d_0} m

where the expectation is with respect to SS.

Markov's inequality then gives the conclusion:

PS[μ(f(FS)F)ε2]ES[μ(f(FS)F)]ε/2<2d0mε=δ.\mathbb P_S\left[\mu(f(F|_S)\triangle F)\ge \frac \varepsilon 2\right]\le \frac {\mathbb E_S[\mu(f(F|_S)\triangle F)]}{\varepsilon/2}<\frac {2d_0}{m\varepsilon}=\delta.

We proceed to defining the learning function ff. Fix a set S={s1,,sm}S=\{s_1,\ldots,s_m\}. Suppose that we are given the labelling FSF|_S, for some FFF\in\cal F, and we want to guess the label of s0s_0, for some s0Ds_0\in D.

Let S:=S{s0}S':=S\cup\{s_0\}. Fix an orientation of HFSH_{\cal F}|_{S'} with out-degree d0=VCdim(F)d_0=\textup{VCdim}(\cal F). If there is only one GFSG\in {\cal F}|_{S'} that agrees with FSF|_S, then G=FG=F and we label s0s_0 according to GG. If there are two concepts G,HFSG,H\in {\cal F}|_{S'} that agree with FSF|_S then GG and HH are adjacent in HFSH_{{\cal F}|_{S'}}. We label s0s_0 according to the one which is the out-neighbor of the other.

This way, by ranging s0s_0 over all elements of DD, we define a concept f(FS)f(F|_S), basing on the labeled sample FSF|_S.

We argue that for all FFF\in \cal F, we have:

ES[μ(f(FS)F)]d0m+1<d0m.\mathbb{E}_S[ \mu(f(F_S)\triangle F)]\le \frac {d_0}{m+1}<\frac {d_0} m.

Fix FFF\in\cal F. We have that

ES[μ(f(FS)F)]=1m+1ES[#((HiF)S)],\mathbb E_{S}[\mu(f(F_S)\triangle F)]=\frac{1}{m+1}\mathbb E_{S'}[\#((H_i\triangle F)\cap S')],

where S={s0,,sm}S'=\{s_0,\ldots,s_m\} is a random sample of size m+1m+1, and we define m+1m+1 concepts H0,,HmH_0,\ldots,H_{m}, by leaving the iith element sis_i out and applying ff as defined above to the labeled sample of size mm. We have:

#((HiF)S)out-deg(FS)d0{\#((H_i\triangle F)\cap S')}\le {\textup{out-deg}(F|_{S'})}\le d_0

This gives the required bound d0m+1<d0m\frac {d_0}{m+1}<\frac {d_0}m. \blacksquare

We now prove Lemma ()(\ast), recalled below.

Lemma (\ast). Fix 0<δ<10<\delta<1 and ε>0\varepsilon>0. Let F\cal F be an ε\varepsilon-separated set system on DD, and let μ\mu be a probability distribution on DD. Let m=2VCdim(F)εδm=\frac {2\text{VCdim}(\cal F)}{\varepsilon\delta} and let SS be a sample of mm independently chosen elements of DD according to the distribution μ\mu. Then

#FES[#FS]1δ.\#{\cal F}\le \frac{\mathbb E_S[\#\cal F|_S]}{1-\delta}.

Proof For FFF\in\cal F and a set SS of mm elements, say that FF is well approximated on SS if μ(f(FS)F)<ε/2\mu(f(F|_S)\triangle F)<\varepsilon/2, where ff is the learning function from the proposition.

We first show that for any fixed set SS of size mm,

#{FF:F is well approximated on S}#FS.\#\{F\in\cal F:\text{$F$ is well approximated on $S$}\}\le \#\cal F|_S.

To prove this, suppose FS=GSF|_S=G|_S. Let F^=f(FS)\hat F=f(F|_S) and G^=f(GS)\hat G=f(G|_S). Since F\cal F is ε\varepsilon-separated we have εμ(FG).\varepsilon\le \mu(F\triangle G). As F^=G^\hat F=\hat G, by the triangle inequality we have:

εμ(FG)μ(FF^)+μ(GG^)\varepsilon\le \mu(F\triangle G)\le \mu(F\triangle \hat F)+ \mu(G\triangle \hat G)

so one of those must be at least ε/2\varepsilon/2. So at most one of F,GF,G is well approximated on SS, which proves the inequality above.

The above inequality can be written as follows, where FFF\in\cal F is chosen uniformly at random:

PF[F is well approximated on S]#FS#F.\mathbb P_F [\text{$F$ is well approximated on $S$}]\le \frac{\#\cal F|_S}{\#\cal F}.

In particular, if SS is a random sample of mm elements chosen independently with any given distribution μ\mu, we have:

ESPF[F is well approximated on S]ES[#FS]#F,\mathbb E_S\mathbb P_F [\text{$F$ is well approximated on $S$}]\le \frac{\mathbb E_S[\#\cal F|_S]}{\#\cal F},

On the other hand, by the proposition, for every fixed FFF\in\cal F we have that FF is well approximated on a random sample SS of mm elements with probability at least 1δ1-\delta. In particular,

EFPS[F is well approximated on S]1δ.\mathbb E_F\mathbb P_S [\text{$F$ is well approximated on $S$}]\ge 1-\delta.

As the left-hand sides in the two inequalities are equal, we get that 1δES[#FS]#F1-\delta\le \frac{\mathbb E_S[\#\cal F|_S]}{\#\cal F}, which yields the lemma.\blacksquare

Zarankiewicz problem and incidence geometry

The Zarankiewicz problem asks about the largest number of edges in a bipartite graph with parts of sizes and m,nm,n that avoids a fixed biclique Ks,tK_{s,t} as a subgraph. Let z(m,n;s,t)z(m,n;s,t) denote this number.

The Kővári–Sós–Turán theorem gives an upper bound.

Theorem. (Kővári–Sós–Turán, 1954)

z(m,n;s,t)<(s1)1/t(nt+1)m11/t+(t1)m=Os,t(mn11/s+n)z(m,n;s,t)<(s-1)^{1/t}(n-t+1)m^{1-1/t}+(t-1)m=O_{s,t}(mn^{1-1/s}+n)
z(n,n;t,t)=Ot(n21/t).z(n,n;t,t)=O_t(n^{2-1/t}).

We prove the latter inequality, for an arbitrary graph: every nn-vertex graph that avoids Kt,tK_{t,t} as a subgraph, has at most O(n21/t)O(n^{2-1/t}) edges.

Proof: See Conlon's notes.\square

Example We show that there is a K2,2K_{2,2}-free graph matching the bound O(n3/2)O(n^{3/2}). Thus, for s=t=2s=t=2, the bound is optimal.

Let pp be a prime and Fp\mathbb F_p be the field with pp elements. Let PP be the set of p2p^2 points in Fp2\mathbb F_p^2, and LL be the set of p2p^2 lines in Fp2\mathbb F_p^2 of the form {(x,y)Fp2:y=ax+b}\{(x,y)\in\mathbb F_p^2:y=ax+b\}, for some a,b\Fpa,b\in\F_p. Since every line contains pp points, there are p3p^3 point-line incidences. This graph excludes K2,2K_{2,2}, since two distinct lines can have only one point in common. Thus we have a bipartite K2,2K_{2,2}-free graph with n=2p2n=2p^2 vertices and Θ(n3/2)\Theta(n^{3/2}) edges, matching the Kővári–Sós–Turán bound. This holds for all nn since for large enough nn there is a prime between nn1/3\sqrt n-n^{1/3} and n\sqrt n (this is a very deep result).\square

It turns out that for bipartite graphs arising as incidence graphs of points and lines in the plane, we get better bounds. Given a collection PP of points and LL of lines in R2\mathbb R^2, let I(P;L)I(P;L) denote the number of pairs (p,)P×L(p,\ell)\in P\times L such that pp\in\ell. Let I(m;n)I(m;n) denote the maximum of I(P;L)I(P;L), over all mm-element PP and nn-element LL.

Theorem (Szemerédi-Trotter, 1983).

I(m;n)O(m2/3n2/3+m+n),I(m;n)\le O(m^{2/3}n^{2/3}+m+n),

and this bound is asymptotically tight.

The bound for mn2m\le n^2 or nm2n\le m^2 follows from the Kővári–Sós–Turán. We follow a proof for the balanced case, of m=nm=n, based on Matoušek, Lectures on Discrete Geometry, Chapter 4.5. This uses the Cutting Lemma for lines in the plane.

We now prove a generalization of the Kővári–Sós–Turán bound, for graphs of bounded VC-dimension. We follow this paper.

Theorem 1 (Fox, Pach, Sheffer, Suk, Zahl 2017). Let G=(A,B,E)G=(A,B,E) be a bipartite graph with A=m,B=n|A|=m,|B|=n and such that the set system G={N(b)bB}\cal G=\{N(b)\mid b\in B\} satisfies πG(z)czd\pi_{\cal G}(z)\le c z^d for some constants c,dc,d and all zz. Then, if GG is Kt,tK_{t,t}-free, we have

EOc,d,t(mn11/d+n).|E|\le O_{c,d,t}(mn^{1-1/d}+n).

The Kővári–Sós–Turán bound is the same, with 11/t1-1/t instead of 11/d1-1/d.

Corollary. Fix d1,d2,cNd_1,d_2,c\in\mathbb N. Let G=(A,B,E)G=(A,B,E), where ARd1A\subset \mathbb R^{d_1} and BRd2B\subset \mathbb R^{d_2} are sets of size m,nm,n respectively, and EA×BE\subseteq A\times B is defined by a boolean combination of cc polynomial equalities and inequalities with d1+d2d_1+d_2 variables of degree at most cc. If GG is Kt,tK_{t,t}-free, then

EOd1,d2,c,t(mn11/d2+n).|E|\le O_{d_1,d_2,c,t}(mn^{1-1/{d_2}}+n).

From the Corollary (for d1=d2=2d_1=d_2=2) and a cutting lemma for algebraic curves in the plane, one can derive the following generalization of the Szemeredi-Trotter result.

Theorem 2. Let G=(A,B,E)G=(A,B,E), where AR2A\subset \mathbb R^2 and BR2B\subset \mathbb R^2 are sets of size m,nm,n respectively, and EA×BE\subseteq A\times B is defined by a boolean combination of cc polynomial equalities and inequalities with 44 variables of degree at most cc. If GG is Kt,tK_{t,t}-free, then

EOc,t(m2/3n2/3+m+n).|E|\le O_{c,t}(m^{2/3}n^{2/3}+m+n).

We only prove Theorem 1. This follows easily from the following generalization of Haussler's theorem.

Say that a family F\cal F of sets on an nn-element domain is (k,ε)(k,\varepsilon)-separated, where ε>0\varepsilon>0, if for every kk sets F1,,FkFF_1,\ldots,F_k\in \cal F we have

(F1Fk)(F1Fk)εn.|(F_1\cup\cdots\cup F_k)-(F_1\cap\cdots\cap F_k)|\ge \varepsilon n.

In other words, for every tuple F1,,FkFF_1,\ldots,F_k\in\cal F, at least an ε\varepsilon-fraction of elements xx, the membership of xx is not the same in all FiF_i's. So a (2,ε)(2,\varepsilon)-separated family is the same as an ε\varepsilon-separated family from the previous lecture (for the counting measure).

Recall that Haussler's theorem says that if F\cal F is an ε\varepsilon-separated family with πF=O(md)\pi_{\cal F}=O(m^d), then FOd((1ε)d)|{\cal F}|\le O_d((\frac{1}{\varepsilon})^d).

We now prove the same result for (k,ε)(k,\varepsilon)-separated families.

Theorem 3 Fix kNk\in\N and ε,d>0\varepsilon,d>0. Let F\cal F be a (k,ε)(k,\varepsilon)-separated family with πF=O(md)\pi_{\cal F}=O(m^d). Then FOk,d((1ε)d)|{\cal F}|\le O_{k,d}((\frac{1}{\varepsilon})^d).

The proof of Haussler's theorem relied on the following.

Proposition. Let ε,δ>0\varepsilon,\delta>0 and m=VCdim(F)εδm=\frac{\textup{VCdim(\cal F)}}{\varepsilon\delta}. There is a "learning function" f ⁣:Sm(F)2Df\colon S_m({\cal F})\to 2^D such that for every FFF\in\cal F and random sample SS of mm elements of DD chosen independently with a given probability distribution μ\mu, we have that

μ[f(FS)F]<ε\mathbb \mu[f(F|_S)\triangle F]<\varepsilon

holds with probability at least 1δ1-\delta over the sample.

We now prove Theorem 3.

Proof. Let F\cal F be a (k,ε)(k,\varepsilon)-separated family with πF=O(md)\pi_{\cal F}=O(m^d).

Fix a parameter 0<δ<10<\delta<1. Apply the proposition to ε/k\varepsilon/k, obtaining m=kVCdim(F)εδm=\frac{k\textup{VCdim}({\cal F})}{\varepsilon\delta} and a learning function f ⁣:Sm(F)2Df\colon S_m({\cal F})\to 2^D. Let μ\mu denote the counting measure on the domain, with μ(F)=F/n\mu(F)=|F|/n.

Say that a set FFF\in \cal F is well-approximated on a set SS of size mm if μ[f(FS)F]ε/k\mu[f(F|S)\triangle F]\le\varepsilon/k. The Proposition says that every FF is well-approximated on a random SS with probability at least 1δ1-\delta.

Therefore, for a random mm-tuple SS, the expected number of sets FF that are well-approximated on SS is at least #F(1δ)\#{\cal F}(1-\delta):

1δ1#FFFES[F is well approximated on S]=1#FESFF[F is well approximated on S].1-\delta\le \frac 1 {\#\cal F}\sum_{F\in\cal F}\mathbb E_S [F \text{ is well approximated on }S]=\frac 1{\#\cal F}\mathbb E_S\sum_{F\in \cal F}[F \text{ is well approximated on }S].

On the other hand, for a given SS, there are at most (k1)#(FS)(k-1)\cdot \#({\cal F}|_S) sets that are well approximated on SS.

Indeed, for any kk-tuple of sets F1,,FkF_1,\ldots,F_k with the same restriction to SS, not all of them are well approximated on SS. Let K=f(F1S)K=f(F_1|_S). Then

(F1Fk)(F1Fk)(F1Fk)K=(F1K)(FkK),(F_1\cup\ldots\cup F_k)-(F_1\cap \cdots \cap F_k)\subseteq (F_1\cup\cdots \cup F_k)\triangle K=(F_1\triangle K)\cup\cdots\cup (F_k\triangle K),
so if all were well approximated, then we would have that the RHS has measure at most ε\varepsilon, and so does the LHS, contradicting the assumption that F\cal F is (k,ε)(k,\varepsilon)-separated.

Hence, for a fixed set PFSP\in \cal F|_S, there at most k1k-1 sets in F\cal F that restrict to PP, and are well approximated on SS. Therefore, the number of sets FF that are well-approximated on any SS is at most (k1)#(FS)(k-1)\cdot \#({\cal F}|_S).

We get that the expected number XX of sets FF that are well-approximated on SS satisfies

#F(1δ)X(k1)ES#(FS)Ok(πF(m)).\#{\cal F}\cdot (1-\delta)\le X \le (k-1)\cdot \mathbb E_S \#({\cal F}|_S)\le O_k(\pi_{\cal F}(m)).

Since δ\delta is a fixed constant, this proves

#FOk(πF(m)),\#{\cal F}\le O_k(\pi_{\cal F}(m)),
as required. \blacksquare

We now prove Theorem 1.

Theorem 1 Let G=(A,B,E)G=(A,B,E) be a bipartite graph with A=m,B=n|A|=m,|B|=n and such that the set system G={N(b)bB}\cal G=\{N(b)\mid b\in B\} satisfies πG(z)czd\pi_{\cal G}(z)\le c z^d for some constants c,dc,d and all zz. Then, if GG is Kt,tK_{t,t}-free, we have

EOc,d,t(mn11/d+n).|E|\le O_{c,d,t}(mn^{1-1/d}+n).

Proof

Claim. There is d=O(m/n1/d)d=O(m/n^{1/d}) and a set BBB'\subseteq B with B=t|B'|= t such that at most dd elements aAa\in A are non-homogeneous towards BB'.

Proof. Let cc be the multiplicative factor from Theorem 3, and fix bb to be defined below. Suppose that for every tt-tuple BB', at least d:=bm/n1/dd:=b m/n^{1/d} elements aAa\in A are non-homogeneous towards BB'. Then G\cal G is (k,b/n1/d)(k,b/n^{1/d})-separated, so by Theorem 3, we have Gcn/bd=nc/bd|\cal G|\le c\cdot n/b^d=n\cdot c/b^d, whereas Gn|\cal G|\ge n. Picking bb so that c<bdc<b^d we get a contradiction. Thus, we may take d=bm/n1/dd=bm/n^{1/d}.\square

Since GG is Kt,tK_{t,t}-free, it follows that every vertex bBb\in B' has degree at most d+td+t in GG. We may remove any such vertex bb and repeat. Therefore, En(d+t)=O(mn11/d+n)|E|\le n\cdot (d+t)=O(mn^{1-1/d}+n), as required. \blacksquare

Matchings with small crossing number

We follow Geometric Discrepancy, Jiri Matousek, Chapter 5.4. We prove Theorem 5.17 from there:

Theorem Let F\cal F be a set system on a 2n2n-element domain DD, with πF(m)cmd\pi_{\cal F}(m)\le c\cdot m^d for some constants c,d>1c,d>1. Then there exists a perfect matching on DD whose crossing number is at most Oc,d(n11/d+logn)O_{c,d}(n^{1-1/d}+\log n).

Here, the crossing number of a graph (V,E)(V,E) with respect to F\cal F is the maximum, over all FFF\in\cal F, of the number of pairs {u,v}E\{u,v\}\in E with uFu\in F and vFv\notin F.

Corollary Let F\cal F be a set system on an nn-element domain DD, with πF(m)cmd\pi_{\cal F}(m)\le c\cdot m^d for some constants c,d1c,d\ge 1. Then there exists a path on DD whose crossing number is at most Oc,d(n11/d+log(n))O_{c,d}(n^{1-1/d}+\log(n)).

The proof of the corollary (using the theorem) is left as an exercise.

Geometric range searching

Geometric range searching is an area of computational geometry, which stimulated a large research effort in the mid 80's-mid 90's. Many of the results that we have seen in the lecture so far were stimulated by problems in this area (e.g. ε\varepsilon-nets, cuttings, packing lemma, matchings with low crossing numbers). This area has deep connections with important problems in mathematics and in computer science. See Geometric range searching by Jiri Matousek for a survey.

In this area, a set system (D,F)(D,\cal F) is called a range space, and the sets in F\cal F are called ranges. Typical range spaces considered in this area are:

  1. (halfspace range searching) half-spaces in Rd\mathbb R^d
  2. (orthogonal range searching) axis-parallel boxes in Rd\mathbb R^d
  3. (simples range searching) simplices in Rd\mathbb R^d
  4. discs in Rd\mathbb R^d.

For a fixed range space (D,F)(D,\cal F), consider the following problem. We are given a finite set PDP\subset D of points, and our task is to compute a data structure that allows to answer the following queries: given a range FFF\in\cal F:

Those problems are motivated by practical applications, e.g. in databases.

A common generalization of the above problems asks about sums in a commutative semigroup (S,+)(S,+). Here we assume that PP is given together with a weight function w ⁣:PSw\colon P\to S, and the data structure allows to answer queries: given a range FFF\in\cal F, compute the weighted sum

vFPw(v).\sum_{v\in F\cap P}w(v).

This generalizes the above, by taking:

For semigroups like in the first two cases, we assume the unit cost model, where semigroup elements can be represented in constant space and operations take constant time. For the last semigroup, this is an unreasonable assumption, and we need to pick different representations of sets (such as lists or arrays).

Intervals on the line

Let us first consider a very simple example of range spaces: the set of intervals on the line R\mathbb R.

Let PRP\subset \mathbb R be a set of nn points, and assume that they are sorted.

For the range emptiness and counting problems, we can use a very simple solution: we precompute in time O(n)O(n), for every aAa\in A the number nan_a of elements in PP that are a\le a. Then [a,b]P[a,b]\cap P is then the difference of the sums nbnan_b-n_{a'}, where aa' is the predecessor of aa in PP. The query time is O(logn)O(\log n) if we allow arbitrary intervals [a,b][a,b] as queries, and O(1)O(1) if we only consider intervals with endpoints in PP.

The above solution also works for computing weights in some semigroups -- specifically those, which embed into a group. But it does not work for all semigroups, for instance, for (N,max)(\mathbb N,\max).

For range reporting, we just store the elements of PP in a doubly-linked list. The query time is still O(logn)O(\log n) or O(1)O(1) if we agree to represent outputs as enumerators, and O(logn+output)O(\log n+|output|) or O(output)O(|output|) if we wish to output the elements.

Note that we can always have a trivial solution that precomputes all possible answers in time O(n2)O(n^2), and answers queries in time O(logn)O(\log n) or O(1)O(1). This brute-force approach also works for many other range search problems (such as for axis-parallel boxes), and the game is minimize the memory usage and/or pre-processing time, while having small query time.

The following solution works for all semigroups (for the interval range space). In the preprocessing phase, construct a balanced (rooted, ordered) binary tree TT, whose leaves are the points of PP, and the usual left-to-right order on the leaves agrees with the usual order on PP. Hence, TT has height Θ(logn)\Theta(\log n). We will use TT as a binary search tree. With each node vv of TT we associate a canonical interval Iv=[a,b]I_v=[a,b], where aa and bb are the smallest and largest leafs below vv.

Now, given any interval I=[a,b]I=[a,b] (where a,bRa,b\in\mathbb R), we can find using binary search two points a,bPa',b'\in P such that IP=[a,b]PI\cap P=[a',b']\cap P. Moreover, IPI\cap P is a disjoint union of O(logn)O(\log n) canonical intervals, each intersected with PP. Those canonical intervals can be computed while performing the binary search, in time O(logn)O(\log n).

We can precompute the total weight w(IvP)w(I_v\cap P) of every canonical interval IvI_v in time O(n)O(n), while constructing the tree TT. Since the total weight of IPI\cap P is the sum of the total weights of the canonical intervals (as the union is disjoint), we can compute it in time O(logn)O(\log n), given II.

Note that this reasoning works in every semigroup in the unit cost model. The preprocessing time is O(n)O(n) (or O(nlogn)O(n\log n) if the points are not sorted) and query time is O(logn)O(\log n).

Example As mentioned, for (N,+)(\mathbb N,+), or semigroups that embed in groups, we may improve the query time to O(1)O(1), assuming the queried intervals are of the form [a,b][a,b] for a,bPa,b\in P (otherwise we must spend Ω(logn)\Omega(\log n) time even for emptiness).

Another semigroup where we can achieve some speedup is (N{+},min)(\mathbb N\cup\{+\infty\},\min).

This can be formulated in terms of range minimum queries.

We assume the RAM model in the following, and that each input number fits into a single word.

Theorem Given a list a1,,ana_1,\ldots,a_n of numbers, we can compute in time O(n)O(n) a data structure that allows to answer the range minimum queries in time O(1)O(1): given 1ijn1\le i\le j\le n, output arg min(ai,,aj)\argmin(a_i,\ldots,a_j).

This result depends quite delicately on the computation model. If we assume the pointer machine model (where we can follow pointers in constant time, but cannot manipulate those pointers), then there is a lower bound which says that the above is not possible -- there is a Ω(loglogn)\Omega(\log \log n) lower bound for query answering (Harel, Tarjan, 1984).

Simplex range searching

In all the considered geometric range search problems, the range space on Rd\mathbb R^d is fixed, and dd is considered to be a small, fixed constant.

Simplex range searching appears to be an important case to which many other geometric range search problems can often be reduced to. Below, O~()\tilde O(\cdot) hides a factor (logn)O(1)(\log n)^{O(1)}.

Informal statement. Let PRrP\subset \mathbb R^r be an nn-point set and mm be a parameter with nmndn\le m \le n^d. The simplex range search problems and halfspace range problems can be solved with preprocessing time and space O~(m)\tilde O(m) and query time O~(n/m1/d)\tilde O(n/m^{1/d}), Moreover, the query time is optimal, given the preprocessing time and space mm.

For example, on one extreme, for m=nm=n we get preprocessing time O~(n)\tilde O(n) and query time O~(n11/d)\tilde O(n^{1-1/d}), which is necessary. On the other extreme, for m=ndm=n^d we get preprocessing time O~(nd)\tilde O(n^d) and query time O~(1)\tilde O(1).

Logarithmic query time

Consider the halfspace range problem. Recall that for an nn-point set in Rd\mathbb R^d, we have O(nd)O(n^d) different sets PFP\cap F, for different half-spaces FF. The solution is to precompute the answers for all possible PFP\cap F, and then can answer queries in time O(1)O(1). However, to implement this, we also need to be able to find PFP\cap F, given FF.

This can be reduced (by duality) to the point location problem:

Given a subdivision of Rd\mathbb R^d into convex cells, construct a data structure that allows to quickly find the cell containing any given point.

For partitions of Rd\mathbb R^d given by hyperplane arrangements, there is a solution with O(nd)O(n^d) preprocessing and O(logn)O(\log n) query time (Chazelle 1993). This relies on cuttings.

Recall:

Theorem (Chazelle, Friedman) Let HH be a set of nn hyperplanes in Rd\mathbb R^d and let rr be a parameter, with 1<rn1< r\le n. There is a (1/r)-cutting of Rd\mathbb R^d into O(rd)O(r^d) simplices with disjoint interiors, covering Rd\mathbb R^d, such that no simplex is intersected by more than 1/r1/r hyperplanes.

Linear space

Consider the halfspace range problem in two dimensions. That is, the ranges are half-planes in R2\mathbb R^2.

Early solutions

Early solutions are based on the ham-sandwich theorem, and on partition trees.

Suppose for simplicity that the set PP has size nn which is a power of 44. Find a line \ell that dives PP into two sets of size n/2n/2. By the ham-sandwich theorem, there is a line \ell' such that each of the four regions A1,,A4A_1,\ldots,A_4 formed by \ell and \ell' contains n/4n/4 points.

In each of those regions AiA_i, proceed similarly, finding four regions AijA_{ij}, each with n/42n/4^2 elements of PP, and so on.

This way, we get a rooted tree TT of branching 44, which can be used to solve the halfplane range problem. The observation is that any halfplane HH, and for any node of our tree TT, one of the four associated regions is not cut by HH, and so is either contained in HH, or disjoint with HH.

Therefore, we report \emptyset or the precomputed set APA\cap P for this region AA, and proceed recursively with the remaining three regions.

From this we get a query time

T(n)C+3T(n/4),T(n)\le C+3T(n/4),

which resolves to T(n)=nlog43n0.79T(n)=n^{\log_4 3}\approx n^{0.79}.

Spanning trees with low crossing numbers

The work of Haussler, Welzl, Chazelle has lead to a huge improvement over the earlier ideas based on ideas similar to the one above.

Recall the result from the previous lecture/tutorial:

Fact Let F\cal F be a set system on an nn-element domain DD, with πF(m)cmd\pi_{\cal F}(m)\le c\cdot m^d for some constants c,d>1c,d>1. Then there exists a path on DD whose crossing number is at most O(n11/d)O(n^{1-1/d}).

In our case (for halfplanes), we have πF(m)O(m2)\pi_{\cal F}(m)\le O(m^2), so we get a path p1p2pnp_1-p_2-\ldots-p_n on PP with crossing number O(n)O(\sqrt n). Moreover, such a path can be effectively computed (our construction was randomized, this can be derandomized).

Therefore, a given halfspace HH partitions our point set PP into O(n)O(\sqrt n) sets P1,,PkP_1,\ldots,P_k, where each part is either contained in HH or is disjoint with HH, and forms an interval in our path. Therefore, we have reduced the problem to O(n)O(\sqrt n) interval problems, which can be solved jointly in time O~(n)\tilde O(\sqrt n).

Note there is a caveat in the above solution: given a halfplane HH we need to be able to efficiently compute the edges of the path that are cut by HH. This can be solved efficiently for d=2d=2 and d=3d=3 (Chazelle, Welzl 1992), but for higher dd, this seems as difficult as the original problem.

Simplicial partitions

For higher dd, the solution was given by Matousek (1992), basing on cuttings. See Matousek, Geometric range searching, p. 20-21.

Using simplicial partitions, we can construct a partition tree similarly as earlier. Now the recursion is:

T(n)f(r)+κT(2n/r),T(n)\le f(r)+\kappa\cdot T(2n/r),

where κ=O(r11/d)\kappa=O(r^{1-1/d}) is the crossing number of the simplicial partitions and f(r)f(r) is the computation cost spent in each node of the tree, with f(r)O(r)f(r)\le O(r).

We must find an optimal value of rr. If we take a large constant for rr, we get:

T(n)O(n11/d+or(1)).T(n)\le O(n^{1-1/d+o_r(1)}).

By choosing rr as a function of nn appropriately, we get

T(n)O~(n11/d)T(n)\le \tilde O(n^{1-1/d})

and pre-processing time O(nlogn)O(n\log n).

Szemeredi Regularity Lemma

Let GG be a graph and A,BV(G)A,B\subset V(G). Let E(A,B)E(A,B) be the set of edges with one endpoint in AA and one endpoint in BB, and define the density between AA and BB as

d(A,B)=E(A,B)AB.d(A,B)=\frac{|E(A,B)|}{|A||B|}.

Say that the pair (A,B)(A,B) (where AA and BB are not necessarily disjoint) is ε\varepsilon-regular if for every AAA'\subset A and BBB'\subset B, if AεA|A'|\ge \varepsilon |A| and BεB|B'|\ge \varepsilon |B|, then

d(A,B)d(A,B)ε.|d(A',B')-d(A,B)|\le \varepsilon.

Say that a partition P\cal P of V(G)V(G) is ε\varepsilon-regular if

(A,B)ABn2ε,\sum_{(A,B)}\frac{|A||B|}{n^2}\le \varepsilon,

where the sum ranges over all pairs (A,B)P2(A,B)\in \cal P^2 that are not ε\varepsilon-regular (possibly A=BA=B).

Theorem (Szemeredi regularity lemma) For every ε>0\varepsilon>0 there is a number NN such that for every graph GG there is an ε\varepsilon-regular partition of V(G)V(G) with at most NN parts.

For notational convenience, we state (and prove) the result in a more general setting. We will consider graphs that are weighted in two ways:

Given (V,E,μ)(V,E,\mu) as above, we may define:

d(A,B):=EaAEbBE(a,b)=aA,bBμ(a)μ(b)E(a,b)μ(A)μ(B).d(A,B):=\mathbb E_{a\in A}\mathbb E_{b\in B}E(a,b)=\frac{\sum_{a\in A,b\in B}\mu(a)\mu(b)E(a,b)}{\mu(A)\mu(B)}.

With this notion, we can define the notion of an ε\varepsilon-regular pair A,BA,B as previously, and say that a partition P\cal P of VV is ε\varepsilon-regular if

(A,B)μ(A)μ(B)ε,\sum_{(A,B)}{\mu(A)\mu(B)}\le \varepsilon,

where the sum ranges over pairs (A,B)P2(A,B)\in\cal P^2 that are not ε\varepsilon-regular. Thus, the probability that a pair (a,b)V2(a,b)\in V^2 belongs to an ε\varepsilon-regular pair (A,B)P2(A,B)\in\cal P^2 is at least 1ε1-\varepsilon.

In the following, fix V,EV, E and μ\mu as above. Our goal is to prove that VV has an ε\varepsilon-regular partition with Oε(1)O_\varepsilon(1) parts.

We will use the following inequality.

Fact Let s1,,sk0s_1,\ldots,s_k\ge 0 and μ1,,μk>0\mu_1,\ldots,\mu_k>0. Then

(isk)2iμiisk2μi.\frac{(\sum_i s_k)^2}{\sum_i \mu_i}\le \sum_i\frac{ s_k^2}{\mu_i}.

Proof Follows from the Cauchy-Schwartz inequality (iaibi)2iai2ibi2(\sum_i{a_ib_i})^2\le \sum_i a_i^2\cdot \sum_i b_i^2 by taking ai=si/μia_i=s_i/\sqrt{\mu_i} and bi=μib_i=\sqrt{\mu_i}.\square

Lemma Let A,BVA,B\subseteq V, A\cal A be a partition of AA and B\cal B be a partition of BB. Then

d(A,B)2μ(A)μ(B)AA,BBd(A,B)2μ(A)μ(B).d(A,B)^2\mu(A)\mu(B)\le \sum_{A'\in {\cal A},B'\in {\cal B}}d(A',B')^2\mu(A')\mu(B').

Proof By the fact above, we have:

A,Bd(A,B)2μ(A)μ(B)=A,B(aAbBμ(a)μ(b)E(a,b))2μ(A)μ(B)\sum_{A',B'}d(A',B')^2\mu(A')\mu(B') =\sum_{A',B'}\frac{(\sum_{a\in A'}\sum_{b\in B'}\mu(a)\mu(b)E(a,b))^2}{\mu(A')\mu(B')}\ge
(A,BaAbBμ(a)μ(b)E(a,b))2A,Bμ(A)μ(B)=(aA,bBμ(a)μ(b)E(a,b))2μ(A)μ(B)=d(A,B)2μ(A)μ(B).\ge\frac{(\sum_{A',B'}\sum_{a\in A'}\sum_{b\in B'}\mu(a)\mu(b)E(a,b))^2}{\sum_{A',B'}\mu(A')\mu(B')}=\frac{(\sum_{a\in A,b\in B}\mu(a)\mu(b)E(a,b))^2}{\mu(A)\mu(B)}=d(A,B)^2\mu(A)\mu(B).

\square

For a partition P\cal P of V(G)V(G) define the potential of P\cal P as

potential(P):=A,BPd(A,B)2μ(A)μ(B).potential({\cal P}):=\sum_{A,B\in \cal P}d(A,B)^2\mu(A)\mu(B) .

Lemma If Q\cal Q is a refinement of P\cal P then potential(P)potential(Q)potential(\cal P)\le potential(\cal Q).

Proof Follows from the previous lemma, by considering all pairs A,BPA,B\in \cal P and adding up the inequalities (where for A\cal A and B\cal B we take the restrictions of Q\cal Q to AA and BB respectively).\square

Lemma We have:

0potential(P)1.0\le potential(\cal P)\le 1.

Proof Because A,BPμ(A)μ(B)=1\sum_{A,B\in \cal P}\mu(A)\mu(B)=1 and 0d(A,B)10\le d(A,B)\le 1 for all A,BA,B.\square

Lemma If a pair (A,B)(A,B) is not ε\varepsilon-regular then there is a partition A=A1A2A=A_1\uplus A_2 and B=B1B2B=B_1\uplus B_2 such that:

i,j{1,2}d(Ai,Bj)2μ(Ai)μ(Bj)μ(A)μ(B)d(A,B)2+ε4\sum_{i,j\in\{1,2\}}\frac{d(A_i,B_j)^2\mu(A_i)\mu(B_j)}{\mu(A)\mu(B)}\ge d(A,B)^2+\varepsilon^4

Proof Since (A,B)(A,B) is not ε\varepsilon-regular, there are sets A1AA_1\subset A, B1BB_1\subset B such that μ(A1)εμ(A)\mu(A_1)\ge \varepsilon \mu(A) and μ(B1)εμ(B)\mu(B_1)\ge \varepsilon\mu(B) and

d(A1,B1)d(A,B)>ε.|d(A_1,B_1)-d(A,B)|>\varepsilon.

Set A2:=AA1A_2:=A-A_1, B2:=BB1B_2:=B-B_1, and δ:=d(A,B)=i,jμ(Ai)μ(Bj)μ(A)μ(B)d(Ai,Bj)\delta:=d(A,B)=\sum_{i,j}\frac{\mu(A_i)\mu(B_j)}{\mu(A)\mu(B)d(A_i,B_j)}. Then we have:

ε4μ(A1)μ(A)μ(B1)μ(B)(d(A1,B1)δ)2\varepsilon^4\le \frac{\mu(A_1)}{\mu(A)}\frac{\mu(B_1)}{\mu(B)}\cdot (d(A_1,B_1)-\delta)^2\le
i,jμ(Ai)μ(Bj)μ(A)μ(B)(d(Ai,Bj)δ)2=\le \sum_{i,j}\frac{\mu(A_i)\mu(B_j)}{\mu(A)\mu(B)}\cdot(d(A_i,B_j)-\delta)^2=
=i,jμ(Ai)μ(Bj)μ(A)μ(B)(d(Ai,Bj)22δd(Ai,Bj)+δ2)==\sum_{i,j}\frac{\mu(A_i)\mu(B_j)}{\mu(A)\mu(B)}\left(d(A_i,B_j)^2-2\delta d(A_i,B_j)+\delta^2\right)=
=i,jμ(Ai)μ(Bj)μ(A)μ(B)d(Ai,Bj)22δi,jμ(Ai)μ(Bj)μ(A)μ(B)d(Ai,Bj)δ+i,jμ(Ai)μ(Bj)μ(A)μ(B)1δ2==\sum_{i,j}\frac{\mu(A_i)\mu(B_j)}{\mu(A)\mu(B)}d(A_i,B_j)^2-2\delta \underbrace{\sum_{i,j}\frac{\mu(A_i)\mu(B_j)}{\mu(A)\mu(B)}d(A_i,B_j)}_\delta+\underbrace{\sum_{i,j}\frac{\mu(A_i)\mu(B_j)}{\mu(A)\mu(B)}}_{1}\delta^2=
=i,jμ(Ai)μ(Bj)μ(A)μ(B)d(Ai,Bj)2δ2.=\sum_{i,j}\frac{\mu(A_i)\mu(B_j)}{\mu(A)\mu(B)}d(A_i,B_j)^2-\delta^2.

This proves the lemma.\square

Lemma Suppose P\cal P is not ε\varepsilon-regular. Then there is a refinement Q\cal Q of P\cal P such that QP4P|\cal Q|\le |\cal P|\cdot 4^{|\cal P|} and

potential(Q)potential(P)+ε5.potential(\cal Q)\ge potential(\cal P)+\varepsilon^5.

Proof Let

I={{A,B}:A,BP,(A,B) not ε-regular}I=\{\{A,B\}:A,B\in\cal P, (A,B)\text{ not }\varepsilon\text{-regular}\}
be the set of irregular pairs in P\cal P. For each {A,B}I\{A,B\}\in \cal I consider the partition A=A1ABA2ABA=A_1^{AB}\uplus A_2^{AB} and B=B1ABB2ABB=B_1^{AB}\uplus B_2^{AB} from the previous lemma, so that

i,j{1,2}μ(AiAB)μ(Bj)ABd(AiAB,BjAB)2(d(A,B)2+ε4)μ(A)μ(B).\sum_{i,j\in\{1,2\}}\mu(A_i^{AB})\mu(B_j)^{AB}d(A_i^{AB},B_j^{AB})^2\ge (d(A,B)^2+\varepsilon^4)\mu(A)\mu(B).

Consider the refinement Q\cal Q of P\cal P, which is the partition into the cells of the Venn diagram of the family of sets P{A,B}I{A1AB,A2AB}{\cal P}\cup \bigcup_{\{A,B\}\in I}\{A_1^{AB},A_2^{AB}\}. Then every APA\in\cal P is refined into at most 22P2^{2|\cal P|} parts, so QP22P|\cal Q|\le |\cal P|\cdot 2^{2|\cal P|}. For all {A,B}I\{A,B\}\in I we have:

A,BQAA,BBμ(A)μ(B)d(A,B)2(d(A,B)2+ε4)(μ(A)+μ(B)),\sum_{\substack{A',B'\in \cal Q\\A'\subseteq A,B'\subseteq B'}}\mu(A')\mu(B')d(A',B')^2\ge (d(A,B)^2+\varepsilon^4)(\mu(A)+\mu(B)),

and for A,BPA,B\in \cal P such that {A,B}I\{A,B\}\notin I we have

A,BQAA,BBμ(A)μ(B)d(A,B)2(d(A,B)2+0)(μ(A)+μ(B)).\sum_{\substack{A',B'\in \cal Q\\A'\subseteq A,B'\subseteq B'}}\mu(A')\mu(B')d(A',B')^2\ge (d(A,B)^2+0)(\mu(A)+\mu(B)).

Summing those inequalities for all A,BPA,B\in\cal P, we get:

A,BQμ(A)μ(B)d(A,B)2A,BPd(A,B)2μ(A,B)+{A,B}Iε4μ(A)μ(B) \sum_{A',B'\in\cal Q}\mu(A')\mu(B')d(A',B')^2\ge \sum_{A,B\in\cal P}d(A,B)^2\mu(A,B)+\sum_{\{A,B\}\in I}\varepsilon^4\mu(A)\mu(B)
=potential(P)+ε4{A,B}Iμ(A)μ(B).=potential({\cal P})+\varepsilon^4\sum_{\{A,B\}\in I}\mu(A)\mu(B).

The sum in the last expression is ε\ge \varepsilon since each pair {A,B}I\{A,B\}\in I is not ε\varepsilon-regular. Therefore we get potential(Q)potential(P)+ε5potential(\cal Q)\ge potential(\cal P)+\varepsilon^5.\square

Proof of Szemeredi regularity lemma. Fix ε>0\varepsilon>0 and let GG be a graph. Let P0\cal P_0 be the partition of V(G)V(G) with one part, and for i1i\ge 1, if Pi1\cal P_{i-1} is not already ε\varepsilon-regular, let Pi\cal P_i be the refinement of Pi1\cal P_{i-1} obtained from the previous lemma. Then Pif(i)|\cal P_{i}|\le f(i) for some fixed function ff (which is bounded by a tower of 44's of height polynomial in ii), and potential(Pi)iε5potential(\cal P_i)\ge i\cdot \varepsilon^5. Since potential(Pi)1potential(\cal P_i)\le 1 for all ii, we have that the sequence P0,P1,\cal P_0,\cal P_1,\ldots may have length at most 1/ε51/\varepsilon^5. Therefore, the last element of this sequence is an ε\varepsilon-regular partition of size at most f(1/ε5)f(1/\varepsilon^5). \square.

A stronger statement

We now formulate and prove a slightly stronger statement, which is convenient in applications. The statement says that

Theorem (variant of Szemeredi regularity lemma) For every ε>0\varepsilon>0 and mNm\in\N there is a number N=N(ε,m)N=N(\varepsilon,m) such that for every graph GG there is an ε\varepsilon-regular partition of V(G)V(G) with mPNm\le |{\cal P}|\le N and such that all parts of P\cal P, apart form one, have equal size.

The proof is very similar to the previous proof, but in each step of the refinement process we additionally reorganize the obtained partition to obtain a partition into parts of equal size (apart from one).

Regularity for graphs of bounded VC-dimension

There are a couple of improvements of the regularity lemma that one could wish to have:

  1. improve the bound on the number NN of parts, which in the proof above is a tower of exponentials of height polynomial in 1ε\frac{1}{\varepsilon}
  2. remove irregular pairs
  3. have all pairs (apart from a small set of irregular pairs) with density ε\le \varepsilon or 1ε\ge 1-\varepsilon
  4. have all pairs (apart from a small set of irregular pairs) with density 00 or 1\ge 1.

It was proven by Gowers that the function NN cannot be improved in general (it needs to be a tower of exponentials of height polynomial in 1ε\frac{1}{\varepsilon}). Also, we will see that one cannot get rid of the irregular pairs. We will see that for graphs of bounded VC-dimension, we can get both the first and third improvement. We will also see that for stable graphs, we can furthermore get the second improvement.

Say that a pair A,BA,B of sets of vertices is ε\varepsilon-homogeneous if d(A,B)εd(A,B)\le \varepsilon or d(A,B)1εd(A,B)\ge 1-\varepsilon.

Lemma If A,BA,B is ε3\varepsilon^3-homogeneous then A,BA,B is ε\varepsilon-regular.

Proof Exercise.

We will prove the following.

Theorem Fix reals c,d1c,d\ge 1 and 0<ε<1/40<\varepsilon<1/4. Let GG be a graph with πN(G)(m)cmd\pi_{{\cal N}(G)}(m)\le c m^d for all mNm\in\N. There is a partition P\cal P of V(G)V(G) of size O(1/ε2d+1)O(1/\varepsilon^{2d+1}), with all parts of equal size (±1\pm 1), such that at most εP2\varepsilon |\cal P|^2 pairs (A,B)P2(A,B)\in\cal P^2 are not ε\varepsilon-homogeneous.

For a pair A,BA,B of subsets of V(G)V(G), let TABT_{AB} be the set of triples (a,b,b)A×B×B(a,b,b')\in A\times B\times B such that abE(G)ab\in E(G) and abE(G)ab'\notin E(G).

Lemma For a pair A,BA,B of subsets of V(G)V(G) with A=B=m|A|=|B|=m we have that

TAB+TBAε(1ε)m3.|T_{AB}|+ |T_{BA}|\ge \varepsilon(1-\varepsilon)\cdot m^3.

Proof Let εA=TAB/m3\varepsilon_A=|T_{AB}|/m^3 and εB=TBA/m3\varepsilon_B=|T_{BA}|/m^3. Our goal is to prove that εA+εBε(1ε)\varepsilon_A+\varepsilon_B\ge \varepsilon(1-\varepsilon).

Pick a,aAa,a'\in A and b,bBb,b'\in B independently and randomly (with uniform measure). Then

2ε(1ε)P[abE xor a1b1E].2\varepsilon(1-\varepsilon)\le \mathbb P[ab\in E\text{ xor }a_1b_1\in E].

Note that the event abE xor a1b1Eab\in E\text{ xor }a_1b_1\in E implies that at least one of the two following events holds:

These events have probabilities at most 2εA2\varepsilon_A and 2εB2\varepsilon_B, respectively. Hence, by the union bound, we have that 2ε(1ε)P[abE xor a1b1E]2(εA+εB)2\varepsilon(1-\varepsilon)\le \mathbb P[ab\in E\text{ xor }a_1b_1\in E]\le 2(\varepsilon_A+\varepsilon_B), which gives the required conclusion.\square

Proof of the theorem

Let

δ:=ε2n16.\delta:=\frac{\varepsilon^2n}{16}.

Claim. There is a partition Q\cal Q of V(G)V(G) such that QO(nδ)d|{\cal Q}|\le O(\frac n\delta)^d and N(u)N(v)2δ|N(u)\triangle N(v)|\le 2\delta for all u,vV(G)u,v\in V(G) that belong to the same part of Q\cal Q.

Namely, greedily find v1,v2,V(G)v_1,v_2,\ldots\in V(G) such that N(vi)N(vj)δ|N(v_i)\triangle N(v_j)|\ge \delta, until no longer possible. Haussler's packing theorem says that this will stop after at most O(δn)dO(\frac \delta n)^d steps. For every vV(G)v\in V(G) there is some ii such that N(v)N(vi)δ|N(v)\triangle N(v_i)|\le \delta. Therefore, there is a partition A1,A2,A_1,A_2,\ldots of V(G)V(G) of size at most O(nδ)dO(\frac n \delta)^d such that any two vertices u,vu,v in the same part satisfy N(u)N(v)2δ|N(u)\triangle N(v)|\le 2\delta, by the triangle inequality. This proves the claim.

Let

K:=16Qε.K:=\frac{16|\cal Q|}{\varepsilon}.
Partition P\cal P into KK parts of size n/Kn/K so that each part of P\cal P, apart from Q\le |\cal Q| parts, is contained in some part of Q\cal Q. This can be achieved by greedily chopping each part of Q\cal Q into parts of size n/Kn/K, and rearranging the leftovers into parts of size n/Kn/K. Let R\cal R denote those parts of P\cal P that are not contained in any part of Q\cal Q, so that RQ|\cal R|\le |\cal Q|.

Let

I={(A,B):A,BPR,(A,B) is not εhomogeneous}I=\{(A,B): A,B\in \cal P-\cal R, (A,B) \text{ is not }\varepsilon{-homogeneous}\}

We will prove that

I23εK2.|I|\le \frac 23 \varepsilon \cdot K^2.
This will prove the theorem, since:

Proof of Claim. Let T={TAB:(A,B)I}.T=\bigcup\{T_{AB}:(A,B)\in I\}. Recall that if v,vv,v' are in the same part of P\cal P then N(v)N(v)2δ|N(v)\triangle N(v')|\le 2\delta. Morevoer, there are KK parts, each of size N/KN/K. It follows that T2δ(n/K)2K=18ε2n3K|T|\le 2\delta (n/K)^2\cdot K=\frac {1}8\frac {\varepsilon^2n^3}{K}.

On the other hand, by the lemma we proved earlier, TABε(1ε)(nK)3|T_{AB}|\ge \varepsilon(1-\varepsilon)(\frac{n}{K})^3 for all A,BIA,B\in I. Hence,

18ε2n3KT(A,B)ITABIε(1ε)(nK)3.\frac {1}8\frac {\varepsilon^2n^3}{K}\ge |T|\ge \sum_{(A,B)\in I}|T_{AB}|\ge |I|\cdot \varepsilon(1-\varepsilon)(\frac{n}{K})^3.

Rearranging, and by ε<1/4\varepsilon<1/4, we get I23εK2|I|\le \frac 2 3 \varepsilon K^2. \square

Stable regularity

A ladder in a graph GG of length kk consists of two sequences a1,,aka_1,\ldots,a_k and b1,,bkb_1,\ldots,b_k of vertices of GG, such that

aibjE(G)    ij.a_ib_j\in E(G)\iff i\le j.

Note that we pose no requirements on the adjacencies between aia_i and aja_j, nor between bib_i and bjb_j. The ladder index of a graph GG is the maximal length of a ladder in GG.

Our goal is to prove that graphs GG with ladder index bounded by a fixed constant \ell enjoy a particularly well-behaved regularity lemma.

Theorem (Malliaris, Shelah, 2011) For every ε>0\varepsilon>0 and N\ell\in\N there is a number N=N(ε,)N=N(\varepsilon,\ell) such that every sufficiently large graph GG of ladder index at most \ell has a partition into parts of equal size (±1)(\pm 1) into at most NN parts, such that:

We first develop a key tool for working with graphs of bounded ladder index.

Branching index

A tree configuration of depth dd in GG is a complete binary tree TT of depth dd with inner nodes labeled by distinct vertices of GG and leaves labeled by distinct verttices of GG, with the following property. For any inner node uu and its descendant leaf vv,

uvE(G)    v is a right descendant of u.uv\in E(G)\iff v \text{ is a right descendant of }u.

Here, when writing uvE(G)uv\in E(G) we implicitly refer to the corresponding vertices of GG. Therefore, uu is adjacent (in GG) to all its right descendant leaves, and non-adjacent to all its left descendant leaves.

The largest depth dd of a tree configuration in GG is the branching index of GG. The branching index and the ladder index can be bounded in terms of each other.

Lemma If GG contains a ladder of length \ell then GG has branching index at least log2()\log_2(\ell).

Proof Suppose we have a ladder a1,,a,b1,,ba_1,\ldots,a_\ell,b_1,\ldots,b_\ell Construct a balanced binary tree whose leaves, from left to right, are labeled with b1,,bb_1,\ldots,b_\ell, and where every inner node is labelled by aia_i, whenever bib_i is the left-most leaf that is a descendant of the right child of aia_i. This is a tree configuration of depth at least log2()\log_2(\ell). \square

Proposition If GG has a tree configuration of depth 2k+122^{k+1}-2 then GG contains a ladder of length kk.

A set AA of inner nodes of TT contains a tree configuration of depth aa if there is a subset AAA'\subseteq A of AA such that AA' with the ancestor relation is isomorphic to that of a complete binary tree of depth aa. Note that if this is the case, then TT has a tree configuration of depth aa whose inner nodes are the elements of AA'.

Lemma If TT is a tree configuration of depth a+ba+b and the inner nodes of TT are partitioned into two sets ABA\uplus B, then either AA contains a tree configuration of depth aa, or BB contains a tree configuration of depth bb.

Proof of Proposition

A configuration CC_\ell is:

(CC_\ell):

a1a2aq1aqa1ab1b2bq1bqb1bT\begin{matrix} a_1&a_2&\ldots& a_{q-1}&\cdot & a_q&\ldots&a_{\ell-1}&a_\ell\\ b_1&b_2&\ldots& b_{q-1}&\cdot & b_q&\ldots&b_{\ell-1}&b_\ell\\ && & &T && & & \end{matrix}

such that a1,,aa_1,\ldots,a_\ell, b1,,bb_1,\ldots,b_\ell form a ladder of length \ell, TT is a tree configuration TT of depth 2k+122^{k+1-\ell}-2, and 0q0\le q\le \ell is a position, such that:

We prove that there is a configuration CC_\ell, by induction on =0,1,,k\ell=0,1,\ldots,k. We have a configuration C0C_0 by assumption. The existence of a configuration CkC_k yields the conclusion of the proposition.

Suppose we have a configuration CC_\ell as above; we prove that then we have a configuration C+1C_{\ell+1}.

Let bb be the root of TT, and bb' be its right child. Fix a leaf aa of TT which is a right descendant of bb'. Let T0T_0 denote the subtree of TT rooted at the left child of bb'. Let NN denote the set of inner nodes of T0T_0 that are neighbors of aa in GG, and let NN' denote the set of inner nodes of T0T_0 that are non-neighbors of aa in GG.

By the previous lemma we have that either NN or NN' contains a tree coinfiguration of depth 2k22^{k-\ell}-2. This is because 2(2k2)=2k+14=height(T)2=height(T0)2(2^{k-\ell}-2)=2^{k-\ell+1}-4=\textit{height}(T)-2=\textit{height}(T_0),

Suppose that we have that NN contains a tree configuration of depth 2k22^{k-\ell}-2. We can then find a tree configuration TT' of depth 2k22^{k-\ell}-2, such that

a1a2aq1aaqa1ab1b2bq1bbqb1bT\begin{matrix} a_1&a_2&\ldots& a_{q-1}&a&\cdot & a_q&\ldots&a_{\ell-1}&a_\ell\\ b_1&b_2&\ldots& b_{q-1}&b'&\cdot & b_q&\ldots&b_{\ell-1}&b_\ell\\ && & &&T' && & & \end{matrix}

is a configuration C+1C_{\ell+1}.

Otherwise, by the previous lemma, we have that the set NN' of inner nodes of T0T_0 that are non-neighbors of aa, contains a tree configuration TT' of depth 2k12^{k-\ell}-1. Then we have the following configuration C+1C_{\ell+1}:

a1a2aq1aaqa1ab1b2bq1bbqb1bT\begin{matrix} a_1&a_2&\ldots& a_{q-1}&\cdot&a & a_q&\ldots&a_{\ell-1}&a_\ell\\ b_1&b_2&\ldots& b_{q-1}&\cdot&b & b_q&\ldots&b_{\ell-1}&b_\ell\\ && & &T'& && & & \end{matrix}

\square

Regularity for stable graphs

We follow the proof from here.

Fix 12>ε>0\frac 1 2>\varepsilon>0. A set AA of vertices of a graph GG is ε\varepsilon-good if for every vertex vv of GG, either N(v)AεA|N(v)\cap A|\le \varepsilon |A|, or N(v)A(1ε)A|N(v)\cap A|\ge (1-\varepsilon) |A|. Thus, every vertex vv has an `opinion' about AA, which is defined as

opinion(v,A)={0if N(v)AεA1if N(v)A(1ε)A.opinion(v,A)=\begin{cases}0 &\text{if }|N(v)\cap A|\le \varepsilon |A|\\ 1 &\text{if }|N(v)\cap A|\ge (1-\varepsilon) |A|. \end{cases}

For example, every singleton set is clearly ε\varepsilon-good, but it is not obvious that large ε\varepsilon-good sets exist.

We say that a set AA is ε\varepsilon-excellent if for every ε\varepsilon-good set BB, we have that all but at most εA\varepsilon|A| vertices of AA have the same opinion about BB. That is, there is a set XA,BAX_{A,B}\subset A with XA,BεA|X_{A,B}|\le \varepsilon |A| such that opinion(v,B)opinion(v,B) is equal to some constant c{0,1}c\in\{0,1\}, for all vAXA,Bv\in A-X_{A,B}. We then define opinion(A,B){0,1}opinion(A,B)\in\{0,1\} as cc.

Note that every ε\varepsilon-excellent set is ε\varepsilon-good (take BB to be singletons).

A stronger version of the Malliaris-Shelah theorem is the following.

Theorem (Malliaris, Shelah, 2011) For every ε>0\varepsilon>0 and N\ell\in\N there is a number N=N(ε,)N=N(\varepsilon,\ell) such that every sufficiently large graph GG of ladder index at most \ell has a partition into parts of equal size (±1)(\pm 1) into at most NN parts, such that:

This statement implies the previous one, as we have the following lemma.

Lemma For every 0<ε<120<\varepsilon<\frac 1 2 there exists δ=2ϵ\delta=\sqrt{2\epsilon} such that every pair (A,B)(A,B) of δ\delta-excellent sets is ε\varepsilon-regular, with edge density ε\le\varepsilon or (1ε)\ge (1-\varepsilon).

Now, the crux of the proof of the theorem is the following lemma, stating that large ε\varepsilon-excellent sets can be found in any set: every set AA contains a subset AA' whose size is a constant fraction of AA, which is ε\varepsilon-excellent.

Lemma Suppose dd is the maximal depth of a tree configuration in GG, and ε<12d\varepsilon<\frac 1 {2^d}. For every AV(G)A\subset V(G) there exists an ε\varepsilon-excellent subset AAA'\subset A with Aεd1A|A'|\ge \varepsilon^{d-1}|A|.

Proof Level by level, for each node vv of the complete binary tree of depth d+1d+1, we label vv by two sets, AvA_v and BvB_v, where:

Note that AviεAv|A_{v_i}|\ge \varepsilon |A_v| by construction. Therefore, AvAεs|A_v|\ge |A|\cdot \varepsilon^s for nodes at distance ss from the root. since

If we can find such a labelling up to the last level, we get a contradiction as follows. First, label each leaf vv by any element of AvA_v.

For each inner node uu that is an ancestor of vv, uu has a majority opinion about BuB_u, because BuB_u is ε\varepsilon-good. Let XuvBuX_{uv}\subset B_u denote the exceptions to that majority opinion. Let Xu=vXuvX_{u}=\bigcup_{v}X_{uv} be the union over all leaf descendants of uu. Then Xu<Bu|X_u|<|B_u| because ε<12d\varepsilon<\frac 1 {2^d}. Label uu by any element of BuXuB_u-X_u. We obtain a tree configuration of depth d+1d+1, a contradiction.\square

The rest of the proof of the regularity theorem of Malliaris and Shelah roughly proceeds by starting with R=VR=V, and repeatedly (for i=1,2,...i=1,2,...) finding a maximal size subset AiA_i of RR that is ε\varepsilon-excellent, and then repeating the same with RR replaced by RAiR-A_i, until RR becomes sufficiently small. We then add RR to A1A_1, which does not spoil its ε\varepsilon-excellence, since A1A_1 is very large relative to RR. There are some additional difficulties that need to be overcome in order to get a partition into parts of equal size. See this paper for more details.

Permutation classes

Sources: articles by Wojciech Przybyszewski, part 1 and part 2.