Selected topics in combinatorics revolving around Vapnik-Chervonenkis dimension

Spring 2023

Lecturer: Szymon Toruńczyk

What is this lecture about? A big portion of the lectures will be concerned with the Vapnik-Chervonenkis dimension. This fundamental notion has numerous applications in computational learning theory, combinatorics, logic, computational geometry, and theoretical computer science. We will develop the foundations of the theory and study its applications.

Outline

Literature

Lectures in Discrete Geometry, Jiri Matousek
Geometric Discrepancy, Jiri Matousek
Topics in Combinatorics, lecture notes, Artem Chernikov,
Geometric set systems, Jiri Matousek
Geometric range searching, Jiri Matousek
Understanding Machine Learning: From Theory to Algorithms, Shai Shalev-Shwartz and Shai Ben-David

Additional sources

Proofs from the Book
Combinatorial Geometry, Janos Pach, Pankaj Agarwal,
Lecture notes on Extremal graph theory, David Conlon,
Computational Learning Theory, James Worrell,
Traces of finite sets: Extremal problems and geometric applications, Furedi, Pach.
Turan-type theorems: The history of degenerate (bipartite) extremal graph problems Zoltan Furedi and Miklos Simonovits,

Specific topics

von Neumann's minimax theorem, Farkas lemma, finite-dimensional Hahn-Banach: Matousek, Lectures on Discrete Geometry; also post by Terence Tao.
Szemeredi-Trotter: Matousek, Lectures on Discrete Geometry, Chap. 4.3; using cutting and Kovari-Sos-Turan: Chap. 4.5.
Cutting Lemma: Matousek, Lectures on Discrete Geometry, Chap. 4.6 (weaker version), 4.7 (stronger version)

VC-dimension

Set systems

Definition: A set system consists of a domain $X$ (a set) and a family $\cal F\subseteq P(X)$ of subsets of $X$ . For $Y\subset X$ , write ${\cal F}\cap Y$ for the set system with domain $Y$ and family $\{F\cap Y: F\in\cal F\}$ .

Remark A set system is the same as a hypergraph.

Examples:

open intervals: domain $\mathbb R$ , family $\{(a,b):a,b\in\mathbb R, a<b\}$
rectangles: domain $\mathbb R^2$ , family $\{(a,b)\times (c,d):a,b,c,d\in\mathbb R, a<b,c<d\}$
discs: domain $\mathbb R^2$ family $\{D_r(p): p\in \mathbb R^2, r>0\}$
convex sets in $\mathbb R^2$
higher-dimensional analogues in $\mathbb R^d$ ,
neighborhoods in graphs

Usually $X=\bigcup\cal F$ , then we just mention $\cal F$ .

Concepts in machine learning. Goal: given a labelled sample, find a reasonable hypothesis (concept). A labelled sample is the same as a set $Y$ and a set $F\cap Y$ , for some concept $F\in\cal F$ . The goal is to "guess" the concept $F$ , or something close to $F$ .
In the area of geometric range searching, we fix a set system of geometric shapes in $\mathbb R^d$ , like the ones in the examples above. The sets in $\cal F$ are called ranges. Given a finite set $X\subset \mathbb R^d$ , the task is to compute a data structure that allows to quickly answer queries of the form: for a given range $F\in\cal F$ , what is $\#(F\cap X)$ ?

VC-dimension and shatter function

The goal is to define a measure of complexity of a set system. Machine learning intuition: understand which concept classes are learnable. When does Ockham's razor work well?

Definition (VC-dimension) [Vapnik, Chervonenkis, 71] Let $\cal F$ be a set system with domain $X$ . A set $Y\subseteq X$ is shattered by $\cal F$ if ${\cal F}\cap Y=2^Y$ . The VC-dimension of $\cal F$ is the maximum size $d$ of a set $Y\subseteq X$ shattered by $\cal F$ , or $\infty$ if arbitrarily large sets are shattered.

Definition (Shatter function). Let $\cal F$ be a set system on $X$ . Let

\pi_{\cal F}(m)=\max_{Y\subseteq X, |Y|\le m}|{\cal F}\cap Y|.

Examples

half-spaces in $\mathbb R^d$ have VC-dim at most $d+1$ (see exercises)
revisit examples above.

Observation If $\cal F$ has infinite VC-dimension then $\pi_{\cal F}(m)=2^m$ for all $m$ .

Definition (Dual set system). Let $\cal F$ be a set system with domain $X$ . The dual system $\cal F^*$ is the system with domain $\cal F$ and family $\{x^*:x\in X\}$ , where $x^*=\{F\in\cal F: x\in F\}\subset\cal F$ .

Another view on set systems: bipartite graphs $(L,R,E)$ , $E\subseteq L\times R$ . (Multiplicities are ignored). Dual: $(R,L,E^{-1})$ . In particular, $(\cal F^*)^*\cong\cal F$ , up to removing twins (elements in the domain that belong to exactly the same sets).

$\pi_{\cal F^*}$ is denoted $\pi^*_{\cal}$ , and is called the dual shatter function of $\cal F$ . $VC(\cal F^*)$ is denoted $VC^*(\cal F)$ , and is called the dual VC-dimension of $\cal F$ .

$VC(\cal F^*)$ is the maximal number of sets in $\cal F$ such that every cell in the Venn diagram is occupied by some element of the domain. $\pi^*_{\cal F}(m)$ is the maximal number of occupied cells in the Venn diagram of ${\le}m$ sets from $\cal F$ .

Example If $\cal F$ is the family of half-planes in $\mathbb R^2$ , then $\pi^*_{\cal F}(m)$ is the maximal number of regions into which $m$ half-planes can partition the plane. For example, $\pi^*_{\cal F}(2)=4$ and $\pi^*_{\cal F}(3)=7$ .

Exercise Prove that $\pi^*_{\cal F}(m)={m+1\choose 2}+1$ .

Exercise If $VC(\cal F)\le d$ then $VC^*(\cal F)< 2^{d+1}$ .

In all examples, we have either a polynomial, or an exponential shatter function – nothing in between.

Sauer-Shelah-Perles Lemma. [Sauer (72), Shelah and Perles (72), Vapnik-Chervonenkis (71)] Let $\cal F$ be a set system on $n$ elements of VC-dimension at most $d$ . Then

|\cal F|\le {n\choose 0}+\ldots+{n \choose d}=:\Phi_d(n).

It follows that for all

m

\pi_{\cal F}(m)\le \Phi_d(m)\le \left(\frac{em}d\right)^k\le O(m^d).

Note. This bound is tight. Consider the set system $\cal F$ of all $\le d$ -element subsets of an infinite set $X$ . Clearly $\pi_{\cal F}(m)=\Phi_d(m)$ .

Proof. Let $X$ be the domain of the set system, and pick any $x\in X$ .

Denote $\cal F':=\cal F\cap (X-\{x\})$ . How much does $\cal F'$ differ from $\cal F$ ? There is a natural mapping from $\cal F$ to $\cal F'$ , namely $F\mapsto F-\{x\}$ . The preimage of a set $G\in\cal F'$ under this mapping consists of either exactly two sets, if $G\in\cal F$ and $G\cup\{x\}\in\cal F$ , or otherwise, exactly of one set.

Denote

\cal M_x:=\{G\in\cal F\mid x\notin G,G\cup\{x\}\in \cal F\}

It follows that

|\cal F|=|\cal F'|+|\cal M_x|.

Note that both $\cal F'$ and $\cal M_x$ are set systems with domain $X-\{x\}$ of size $n-1$ . The key observation is that $\cal M_x$ has VC-dimension at most $d-1$ . Indeed, if $A\subseteq X-\{x\}$ is shattered in $\cal M_x$ , then $A\cup\{x\}$ is shattered in $\cal F$ .

By induction on $n$ and $d$ we have that:

|\cal F|= |\cal F'|+|\cal M_x|\le {n-1\choose {\le} d}+{n-1\choose {\le} (d-1)}.

Note that

{n-1\choose {\le} d}+{n-1\choose {\le} (d-1)}\le {n\choose {\le} d}

since we can surjectively map a

{\le}d

-element subset

A

[n]

A\in {{[n-1]}\choose {\le d}}

n\notin A

, and to

A-\{x\}\in {[n-1]\choose {\le} (d-1)}

n\in A

\blacksquare

Exercise For every element $v$ in the domain, define a (partial) matching $M_v^{\cal F}$ on $\cal F$ , by

M_v=\{\{A,B\}:A,B\in \mathcal F, A\triangle B=\{v\}\}.

Show that if for every subset

X

of the domain there is

v\in X

, with

|M_v^{\cal F|_X}|\le k

, then

|\mathcal F|\le O(k\cdot n)

Set systems in logical structures

Let $M$ be a logical structure and $\phi(\bar x;\bar y)$ be a formula. For a tuple $\bar b\in M^{\bar y}$ denote

\phi(M;\bar b):=\{\bar a\in M^{\bar x}: \phi(\bar a,\bar b)\}.

Consider the set system $\cal F_\phi$ with domain $M^{\bar x}$ and sets $\phi(M;\bar b)$ , for $\bar b\in M^{\bar y}$ .

Examples

$(\mathbb R,+,\cdot,\le,0,1)$ , formula

\phi(x;y_1,y_2)\equiv y_1\le x\le y_2.

Then $\cal F_\phi$ is the set system of closed intervals on $\mathbb R$ .

$(\mathbb R,+,\cdot,\le,0,1)$ , formula

\phi(x_1,x_2;y_1,y_2,y_3)\equiv (x_1-y_1)^2+(x_2-y_2)^2\le p^2.

Then $\cal F_\phi$ is the set system of discs in $\mathbb R^2$ .

$(\mathbb R,+,\cdot,\le,0,1)$

\phi(x_1,x_2;y_1,y_2,y_3,y_4)\equiv y_1\le x_1\le y_2\land y_3\le x_2\le y_4.

$\cal F_\phi$ is the set of rectangles in $\mathbb R^2$ .

A graph $G=(V,E)$ is seen as a logical structure with a binary relation $E$ , denoting adjacency.

A first-order formula is a formula using $\exists, \forall, \lor,\land,\neg$ , where the quantifiers range over elements of the domain.

Example $\phi(x,y)\equiv \exists z.y=x+z^2$ is equivalent in $(\mathbb R,+,\cdot,\le,0,1)$ to $x\le y$ . So $\le$ is definable in $(\mathbb R,+,\cdot)$ . The relation $y=e^x$ is not definable in $(\mathbb R,+,\cdot)$ .

Example The formula $\delta_1(x,y)\equiv (x=y) \lor E(x,y)$ expresses that $dist(x,y)\le 1$ , and the formula $\delta_2(x,y)\equiv x=y \lor E(x,y)\lor \exists z. E(x,z)\land E(z,y)$ expresses that $dist(x,y)\le 2$ , and similarly we can define $\delta_r(x,y)$ expressing $dist(x,y)\le r$ , for any fixed number $r$ . There is no first-order formula that expresses the property ` $x$ and $y$ are at a finite distance' in all graphs. This can be expressed by a formula of monadic second-order logic (MSO).

Theorem (Shelah, 71) Let $M$ be a logical structure. The following conditions are equivalent:

every first-order formula $\phi(\bar x;\bar y)$ has bounded VC-dimension in $M$ ,
every first-order formula $\phi(x;\bar y)$ has bounded VC-dimension in $M$ (here $x$ is a singleton).

If the above conditions hold, then $M$ is called NIP.

Example $(\mathbb R,+,\cdot,\le)$ is NIP. We use the result of Tarski:

Theorem (Tarski, 40) The structure $(\mathbb R,+,\cdot,\le,0,1)$ has quantifier elimination: every first-order formula $\phi(x_1,\ldots,x_k)$ is equivalent to a formula without quantifiers.

Such formulas are just boolean combinations of sets of the form:

\{(a_1,\ldots,a_k)\in \mathbb R^k: p(x_1,\ldots,x_k)\ge 0\}

\{(a_1,\ldots,a_k)\in \mathbb R^k: p(x_1,\ldots,x_k)= 0\}

where $p(x_1,\ldots,x_k)$ is a polynomial with integer coefficients, such as $x_1^2+x_2^2\le 5$ .

Corollary (Tarski, 30) For every first-order formula $\phi(x;y_1,\ldots,y_k)$ and parameters $b_1,\ldots,b_k\in\mathbb R$ , the set $\phi(\mathbb R;b_1,\ldots,b_k)\in\mathbb R$ is a union of at most $k_\phi$ intervals, where $k_\phi$ depends only on $\phi$ (and not on $b_1,\ldots,b_k$ ).

Proof. We may assume that $\phi$ is a boolean combination of $\ell$ sets as above. Let $d$ be the maximum degree of the polynomials. Each polynomial of degree $d$ with one free variable defines a set that is a union of at most $d$ intervals. A boolean combination of $k$ such sets is a union of at most $k:=\ell d$ intervals.

Corollary The structure $(\mathbb R,\cdot,+,\le,0,1)$ is NIP.

Proof. The family of all sets that are unions of $k$ intervals cannot shatter a set of $2k+1$ points. The conclusion follows from the result of Shelah.

Definition A structure $M$ with a total order $<$ and other relations/functions is o-minimal if for every first-order formula $\phi(x;y_1,\ldots,y_k)$ and parameters $b_1,\ldots,b_k$ , the set $\phi(M;b_1,\ldots,b_k)$ is a union of finitely many intervals.

It is known that this implies that then it is a union of at most $n_\phi$ intervals, for some $n_\phi$ that depends only on $\phi$ .

Corollary Every o-minimal structure is NIP.

Examples of o-minimal structures

$(\mathbb R,+,\cdot,\le,0,1)$
$(\mathbb R,+,\cdot,\le,e^x, 0,1)$ (Wilkie's theorem)

Other examples of NIP structures

abelian groups
Presburger arithmetic $(\mathbb Z,+,<)$ .

Classes of structures

Let $\cal C$ be a class of logical structures. For example, the class of all planar graphs. A formula $\phi(\bar x;\bar y)$ defines a set system ${\cal F}^M_{\phi}$ in each structure $M\in\cal C$ . We say that $\phi$ has VC-dimension ${\le}d$ in $\cal C$ if each of the set systems ${\cal F}^M_\phi$ has VC-dimension at most ${\le}d$ . The shatter function $\pi_{\phi,\cal F}(m)$ is defined as the maximum of $\pi_{{\cal F}^M_\phi}$ , over all $M\in\cal C$ . A class $\cal C$ is NIP if every formula $\phi(\bar x;\bar y)$ has bounded VC-dimension on $\cal C$ .

Example Let $\cal C$ a class of graphs and $\phi(x,y)\equiv E(x,y)$ . Then $\pi_{E,\cal C}(m)$ is the following quantity: the maximum, over all planar graphs $G$ and sets $A\subseteq V(G)$ with $|A|\le m$ , of the number of distinct vertices $v\in V(G)$ with different neighborhoods in $A$ . If $\cal C$ is the class of bipartite graphs, then $\pi_{E,\cal C}(m)=2^m$ . If $\cal C$ is the class of planar graphs, then $\pi_{E,\cal C}(m)=O(m)$ .

Fact The class $\cal C$ of planar graphs is NIP. [Podewski, Ziegler, 1978] Moreover, for every fixed first-order formula $\phi(x;y)$ , we have $\pi_{\phi,\cal C}(m)=O(m)$

This holds more generally for all classes $\cal C$ with bounded expansion [Pilipczuk, Siebertz, Toruńczyk, 2018] and for classes of bounded twin-width [Bonnet et al. 2022, Przybyszewski, 2022].

Classes with bounded expansion include classes of bounded tree-width, classes of graphs that can be embedded in a fixed surface, classes of graphs with bounded maximum degree, and classes that exclude a fixed minor. They are contained in nowhere dense classes. $\pi_{\phi,\cal C}(m)=O(m^{1+\varepsilon})$ holds for every nowhere dense class $\cal C$ and fixed $\varepsilon>0$ [Pilipczuk, Siebertz, Toruńczyk, 2018].

VC-density

Definition (VC-density). The VC-density of a set system $\cal F$ is the infimum of all $r\in\mathbb R$ such that $\pi_{\cal F}(m)=O(m^r)$ . Similarly, we define the VC-density of a formula $\phi(\bar x;\bar y)$ in a class $\cal C$ of structures.

Example Every formula $\phi(\bar x;y)$ (with $y$ singleton) has VC-density $1$ on the class of planar graphs, and more generally, on all nowhere dense classes.

Theorem (Assouad, 1983) The VC-density of any set system $\cal F$ is either $0$ , or a real number ${\ge }1$ . For every real $\alpha\ge 1$ there is a set system $\cal F$ with VC-density $\alpha$ .

We will construct a family with VC-density $3/2$ . See Chernikov's notes, Example 2.7. For this we use the following result, which will be proved later.

Theorem (Kovari, Sos, Turan, 1954) Every $K_{2,2}$ -free bipartite graph with $n+n$ vertices has $O(n^{3/2})$ edges.

Example Fix a prime number $p$ . For every power $q$ of $p$ there is a field $\mathbb F_q$ with $q$ elements. Consider the set system $\cal F_q$ where:

the domain consists of points in $\mathbb F_q^2$ and lines in $\mathbb F_q^2$ (solutions of equations $y=ax+b$ ).
the family consists of two-element sets $\{p,\ell\}$ where $\ell$ is a line and $p$ is a point in $\ell$ .

$\pi_{\cal F_q}(m)=O(m^{3/2})$ . Pick an $m$ -element subset $A$ . The restriction $\cal F_q\cap A$ consists of:

the empty set (1 element),
singletons ( $m$ elements),
and of pairs $\{p,\ell\}$ such that $p,\ell\in A$ and $p\in \ell$ .

The number of pairs is the number of edges in the bipartite graph $G_A$ whose parts are the points $p\in A$ , and the lines $\ell \in A$ , and edges denote incidence. The graph $G_A$ is $K_{2,2}$ -free, so it has at most $O(m^{3/2})$ edges. Altogether, $|{\cal F}_q\cap A|\le 1+m+O(m^{3/2})=O(m^{3/2})$ .

On the other hand, $|{\cal F}_q|\ge q^3$ and there are at most $2q^2$ points and lines. Therefore, $|{\cal F}_q|\ge \Omega(|points|+|lines|)^{3/2}$ .

$\varepsilon$ -nets

Let $\cal F$ be a set system on an $n$ -element domain $D$ and let $\varepsilon\ge 0$ be a fixed real. A subset $S\subseteq D$ is an $\varepsilon$ -net if $F\cap S\neq\emptyset$ for every set $F\in\cal F$ with $|F|>\varepsilon\cdot n$ .

More generally, let $\mu\colon D\to [0,1]$ be a probability distribution on $D$ and let $\cal F$ be a set system on $D$ . Then $S\subseteq D$ is an $\varepsilon$ -net for $\cal F$ wrt. $\mu$ if $F\cap S\neq\emptyset$ for every set $F\in\cal F$ with $\mu(F)>\varepsilon$ . This generalizes the previous notion, by taking the uniform distribution on $D$ .

As we will see, small $\varepsilon$ -nets are an extremely useful tool. The following simple construction already provides $\varepsilon$ -nets which are reasonably small in many cases.

Lemma Let $\mu\colon D\to [0,1]$ be a probability distribution on $D$ and let $\cal F$ be a set system on $D$ . Then there is an $\varepsilon$ -net $S$ for $\cal F$ wrt. $\mu$ with

|S|\le \frac{\ln(|F|)}\varepsilon.

Proof Fix

k\in\N

and pick

k

elements

v_1,\ldots,v_k\in S

independently at random, each according to the probability distribution

\mu

. Let

S=\{v_1,\ldots,v_k\}

. For a set

F\subseteq D

with

\mu(F)>\varepsilon

, the probability that

S

does not intersect

F

is at most

(1-\mu(F))^k<(1-\varepsilon)^k<e^{-\varepsilon k}

. Therefore, the probability that any of the sets

F\in\cal F

does not intersect

S

is less than

|\cal F|\cdot e^{-\varepsilon k}

. For

k\ge (1/\varepsilon)\ln |\cal F|

S

will be a

\varepsilon

-net with positive probability.

\blacksquare

Corollary If $\cal F$ has VC-dimension $d$ and $|D|=n$ , then there is an $\varepsilon$ -net $S$ for $\cal F$ wrt. $\mu$ with

|S|\le \frac{d\cdot \ln n}\varepsilon.

Proof. By the Sauer-Shelah-Perles lemma we have $|\cal F|\le n^d$ . The conclusion follows from the previous lemma. $\blacksquare$

In fact, a much stronger result holds: the size of the $\varepsilon$ -net can be chosen independent of $n$ (and only depending on $d$ ):

Theorem. Let $\mu\colon D\to [0,1]$ be a probability distribution on $D$ and let $\cal F$ be a set system on $D$ of VC-dimension $d$ . There is an $\varepsilon$ -net $S$ for $\cal F$ with respect to $\mu$ with

|S|\le O(\frac{d}\varepsilon \ln \frac{1}\varepsilon).

We will see a proof of a slightly weaker bound later below.

Exercise. Let $\cal F$ be a set system on an infinite domain $D$ . If $\cal F$ has infinite VC-dimension then for every $\varepsilon>0$ and $n\in\N$ there is a subset $X\subseteq D$ of size $n$ such that there is no $\varepsilon$ -net for $\cal F\cap X$ (with respect to the uniform measure on $X$ ) of size $\le (1-\epsilon)\cdot n$ .

VC-theorem and PAC learning

Let $D$ be a fixed domain. For a multiset $S\subset D$ and a set $F\subset D$ , write $Av_S(F)$ to denote the average number of points in $S$ that belong to $D$ ,

Av_S(F)=\frac {\#\{s\in S: S\in F\}}{|S|}

Definition ( $\varepsilon$ -approximation). Fix a probability measure $\mu$ on $D$ . A multiset $S$ is an $\varepsilon$ -approximation for $\mu$ on a (measurable) set $F\subset D$ if

|\mu(F)-Av_S(F)|\le \varepsilon.

A multiset $S$ is an $\varepsilon$ -approximation for $\mu$ on a set family $\cal F$ with domain $F$ , if $S$ is an $\varepsilon$ -approximation for $\mu$ on $F$ , for each $F\in\cal F$ .

Recall the weak law of large numbers:

Theorem (Weak law of large numbers) Fix a probability measure $\mu$ on a domain $D$ and a measurable set $F\subset D$ . For every $\varepsilon>0$ and number $n$ , a randomly and independently selected sequence $S$ of $n$ -elements of $D$ is an $\varepsilon$ -approximation for $\mu$ on $F$ with probability at least $1-\frac 1 {4n\varepsilon^2}$ .

The VC-theorem is a uniform version, for an entire family $\cal F$ of small VC-dimension. We assume here that the domain is finite. This assumption can be removed at the cost of some additional assumptions.

Theorem (VC-theorem) Fix a probability measure $\mathbb P$ on a finite domain $D$ and a family $\cal F$ of subsets of $D$ . For every $\varepsilon>0$ and number $n$ , a randomly and independently selected sequence of $n$ -elements of $D$ is an $\varepsilon$ -approximation for $\mathbb P$ on $\cal F$ with probability at least

1-8\pi_{\cal F}(n)\cdot \exp({-\frac {n\varepsilon^2}{32}}).

For the proof, see e.g. Chernikov's lecture notes.

N(d,\varepsilon)=O(\frac{d}{\varepsilon^2}\log \left(\frac d\varepsilon\right)).

Proof of Corollary. By Sauer-Shelah-Perles we have that $\pi_{\cal F}(n)\le O(n^d)$ . The key point is that in the value in the VC-theorem,

O(n^d)/ \exp({\frac {n\varepsilon^2}{32}}),

converges rapidly to $0$ as $n\rightarrow\infty$ . In particular, we can find $N(d,\varepsilon)$ large enough so that this number is smaller than any fixed constant $c>0$ . A more careful analysis shows that $O(\frac{d}{\varepsilon^2}\log \left(\frac d\varepsilon\right))$ suffices. See Chernikov's lecture notes. $\square$

Recall that if $\mu$ is a probability distribution on a finite set $D$ , $\cal F$ is a system of subsets of $D$ , and $\varepsilon>0$ , then a set $S\subset D$ is called a $\varepsilon$ -net for $\mu$ on $\cal F$ , if every set $F\in\cal F$ with $\mu(F)> \varepsilon$ has a nonempty intersection with $S$ .

Lemma Every $\varepsilon$ -approximation is an $\varepsilon$ -net.

Proof. For every $F\in \cal F$ , if $S$ has an empty intersection with $F$ then $Av_F(S)=0$ . If $S$ is an $\varepsilon$ -approximation, this implies that $\mu(F)\le \varepsilon$ . $\square$

Corollary For every fixed $d\in\N$ and $\varepsilon>0$ there is a number $N(d,\varepsilon)$ such that if $\cal F$ has VC-dimension $d$ and $\mathbb P$ is a probability distribution on $D$ , then there exists an $\varepsilon$ -net for $\mathbb P$ on $\cal F$ of size at most $N(d,\varepsilon)$ . In fact, a random (with probability $\mathbb P$ ) sample of $N(d,\varepsilon)$ elements of $D$ is an $\varepsilon$ -net, with probability at least $1-\varepsilon$ . Moreover, we can take

N(d,\varepsilon)=O(\frac{d}{\varepsilon^2})\log \left(\frac d\varepsilon\right).

In fact, for $\varepsilon$ -nets (as opposed to $\varepsilon$ -approximations) one can improve the bound above and get

N(d,\varepsilon)=O(\frac{d}{\varepsilon})\log \left(\frac 1\varepsilon\right).

Although we will use this bound in the sequel, the improvement over the previous bound will not be relevant to us.

PAC learning

A concept is a function $f: D\to[0,1]$ . A concept class is a family of concepts on $D$ .

We restrict to $\{0,1\}$ -valued functions (aka sets), which can be seen as a classification problem. So now concept classes are just set families.

A labeled sample is a multiset $S\subset D$ with a partition $S=S_+\cup S_-$ . Denote by $S_N(D)$ the set of all labeled samples of size $N$ . A learning map is a function $f$ that maps every $N$ -sample $S\in S_N(D)$ to a generalization $f(S)\subset D$ . Ideally, $f(S)$ should be consistent with $S$ , that is, $S_+=S\cap f(S)$ and $S_-=S\setminus f(S)$ , but this does not need to happen (it is even desirable to allow errors in the labelling of $S$ ).

Intuition See the papaya intuition from Understanding Machine Learning.

Overfitting We pick $f(S)$ to be $S_+$ . To avoid this, we restrict to some family $\cal F$ of concepts that we are willing to consider, and require that the range of $f$ should be contained in $\cal F$ .

Empirical Risk Minimization is the learning approach that we pick $f(S)$ to be any concept $C\in\cal F$ that minimizes the number of errors on the training set. That is, the number of points in $S$ such that belong to $S_+\triangle C$ is the smallest possible. Here $\cal F$ is a predefined set family (representing our knowledge of the world).

PAC learnability

Say that $\cal F$ is PAC-learnable if for every $\varepsilon,\delta>0$ there exists $N=N(\varepsilon,\delta)$ and a learning map $f:S_N(D)\to 2^D$ such that for every concept $C\in \cal F$ and probability distribution $\mu$ on $D$ , if we select $N$ elements of $D$ independently with distribution $\mu$ each and label it with respect to $C$ , obtaining a labeled sample $S=S_+\uplus S_-$ , then we will have that $\mu(f(S)\triangle C)\le \varepsilon$ with probability at least $1-\delta$ .

If moreover $f$ always returns concepts in $\cal F$ , then we say that $\cal F$ is properly PAC learnable. We say that $\cal F$ is consistently PAC learnable if every function $f:S_N(D)\to \cal F$ such that $f(S)\in \cal F$ is any concept that agrees with $S$ , works.

Theorem (Fundamental theorem of PAC learning). Let $\cal F$ be a concept class. The following conditions are equivalent:

$\cal F$ is PAC learnable,
$VC(\cal F)$ is finite.

Remark There are some untold hidden assumptions in this statement. Either we should assume some measurability conditions, similar to the ones in the VC theorem for infinite domains, or we should assume that the domain is finite. Then the equivalence of the two implications in the theorem should be read as follows. Bottom-up implication: the sample complexity $N(\varepsilon,\delta)$ can be bounded in terms of $\varepsilon,\delta$ and on the VC-dimension of the family $\cal F$ . Top-down implication: the VC-dimension of $\cal F$ can be bounded in terms of the function $N(\cdot,\cdot)$ . More precisely, we can bound it in terms of $N(\frac 1 3,\frac 1 3)$ .

Proof For the bottom-up implication, we may assume for simplicity (by taking the min) that $\delta=\varepsilon$ . Let $N=N(d,\varepsilon)$ be the number obtained from the $\varepsilon$ -net Corollary.

The learning map $f:S_N(D)\to \cal F$ be defined so that $f(S_+,S_-)$ is any concept in $C\in\cal F$ that contains $S_+$ and is disjoint from $S_-$ . To prove that this learning map $f$ has the required properties, we consider the set system

{\cal F}\triangle C:=\{F\triangle C: F\in \cal F\},

where $\triangle$ denotes the symmetric difference of sets. It was shown in the exercises that $\cal F\triangle C$ has the same VC-dimension as $\cal F$ . By the $\varepsilon$ -net corollary, a random sample of $N$ elements of $D$ is an $\varepsilon$ -net for the family $\cal F\triangle C$ , with probability at least $1-\varepsilon$ . This easily implies that the learning map $f$ has the required property.

The top-down implication was shown in the exercices. See also Chernikov's notes.

Transversals and Helly-type properties

Transversals

Definition Fix a set system $\cal F$ on a domain $D$ .

A transversal $T$ is a set $T\subseteq D$ that intersects every $F\in\cal F$
The transversal number of $\cal F$ is the smallest size of a transversal, and is denoted $\tau(\cal F)$
A packing $\cal G$ is a set $\cal G\subseteq \cal F$ of pairwise disjoint sets.
The packing number is the largest size of a packing, and is denoted $\nu(\cal F)$ .

Clearly $\tau(\cal F)\ge \nu(\cal F)$ . It is NP-hard to compute $\tau(\cal F)$ and $\nu(\cal F)$ . For this reason, we sometimes use the following, approximate notions.

Definition Fix a set system $\cal F$ on a finite domain $D$ .

A fractional transversal $T$ is a function $f: D\to[0,1]$ such that for every $F\in\cal F$ we have

\sum_{x\in F}f(x)\ge 1.

Its total weight is $\sum_{x\in D}f(x)$ .

The fractional transversal number of $\cal F$ is the infimum of the total weights of all fractional transversals, and is denoted $\tau^*(\cal F)$ .
A fractional packing $\cal G$ is a function $g:\cal F\to[0,1]$ such that for every $x\in D$ we have

\sum_{F\in{\cal F}: x\in F}g(F)\le 1.

Its total weight is $\sum_{F \in\cal F}g(F)$

The fractional packing number is the supremum of the total weights of all fractional packings, and is denoted $\nu^*(\cal F)$ .

The following lemma is straightforward.

Lemma For every set system $\cal F$ on a finite domain $D$ we have:

\tau^*(\cal F)\ge \nu^*(\cal F)

The following fact is a consequence (or restatement) of the duality theorem for linear programs. It is an easy consequence of Farkas' lemma.

Fact For every set system $\cal F$ on a finite domain $D$ we have

\tau^*(\cal F)=\nu^*(\cal F).

Moreover, this number is rational, computable in polynomial time, and is attained both by a fractional transversal $f:D\to [0,1]$ with rational values, and by a fractional packing $g:\cal F\to[0,1]$ with rational values.

Corollary For every set system $\cal F$ on a finite domain $D$ with $\text{VCdim}(\cal F)=d$ , we have:

\tau(\cal F)\le O(d)\tau^*(\cal F)\log \tau^*(\cal F).

Proof Follows from the existence of $\varepsilon$ -nets of size $O(d)\frac 1 \varepsilon\log (\frac 1 \varepsilon)$ . Namely, pick a fractional transversal $f:V\to [0,1]$ of total weight $r$ , and normalize it to get a measure $\mu$ with $\mu(v)=f(v)/r$ . For a set $F\in\cal F$ we have that $\sum_{x\in F}f(v)\ge 1$ since $f$ is a fractional transversal, so $\mu(F)\ge 1/r$ .

Let $\varepsilon=1/r$ . By the $\varepsilon$ -net theorem there is an $\varepsilon$ -net $T$ for $\mu$ on $\cal F$ of size $O(d)\cdot r\log r$ . Since $\mu(F)\ge 1/r=\varepsilon$ for all $F\in\cal F$ and $T$ is an $\varepsilon$ -net, it follows that $T$ intersects every $F\in\cal F$ .

Helly-type theorems

Fact (Helly's Theorem) Fix $d\ge 1$ . Let $\cal F$ be a finite family of convex sets in $\mathbb R^d$ . If any $d+1$ sets in $\cal F$ have a point in common, then $\cal F$ has a point in common.

Definition A family $\cal G$ has the $k$ -Helly property if for every finite subfamily $\cal F\subset \cal G$ , if every $k$ subsets of $\cal F$ have a point in common, then $\cal F$ has a point in common.

For instance, the family of convex subsets of $\mathbb R^{d}$ has the $(d+1)$ -Helly property.

Families of bounded VC-dimension don't necessarily have the $k$ -Helly property, for any finite $k$ . For example, for an infinite set $V$ , the family of all sets of the form $V-\{a\}$ , for $a\in V$ , has VC-dimension $2$ , but does not have the $k$ -Helly property, for all $k$ .

However, families of bounded VC-dimension satisfy a related property.

Say that a family $\cal F$ has the $(p,q)$ -property if among every $p$ subsets of $\cal F$ , some $q$ among them have a point in common.

For example, the $k$ -Helly property is the same as the $(k,k)$ -property. Note that the conclusion in Helly's theorem is that $\cal F$ has a transversal of size $1$ , that is $\tau(\cal F)\le 1$ . The following is a generalization.

Fact (Alon,Kleitman). Let $p,q,d$ be integers with $p\ge q\ge d+1$ . Then there is a number $N=N(p,q,d)$ such that for every finite family $\cal F$ of convex subsets of $\mathbb R^d$ , if $\cal F$ has the $(p,q)$ -property, then $\tau({\cal F})\le N$ .

Now this statement generalizes to families of bounded VC-dimension, and can be derived from the proof of Alon and Kleitman, as observed by Matousek.

Fact (Alon,Kleitman and Matousek). Let $p,q,d$ be integers with $p\ge q\ge d+1$ . Then there is a number $N=N(p,q,d)$ such that for every finite family $\cal F$ with $\text{VC-dim}^*(\cal F)<d+1$ , if $\cal F$ has the $(p,q)$ -property, then $\tau(\cal F)\le N$ .

Recall that $\text{VC-dim}^*(\cal F)$ is the dual VC-dimension of $\cal F$ , defined as the VC-dimension of $\cal F^*$ , Equivalently, it is the maximal number $k$ of subsets in $\cal F$ such that every cell in the Venn diagram of those sets is nonempty (occupied by some element in the domain).

We reprove a duality result which is a corollary of the $(p,q)$ -theorem. We present a self-contained proof.

Let $E\subseteq A\times B$ be a binary relation. For $a\in A$ denote $E(a):=\{b\in B|(a,b)\in E\}$ and for $b\in B$ denote $E^*(b):=\{a\in A|(a,b)\in E\}$ . Define the VC-dimension of a binary relation $E$ as the maximum of the VC-dimension of two set systems:

(B,\{ E(a)|a\in A\})\qquad\text{and}\qquad (A,\{ E^*(b)|b\in B\}).

Here's the duality result.

Theorem (Corollary of the $(p,q)$ -theorem) For every $d$ there is a number $k\le O(d)$ with the following property. Let $E\subseteq A\times B$ be a binary relation with $\textup{VCdim}(E)\le d$ , with $A,B$ finite. Then at least one of two cases holds:

there is a set $B'\subseteq B$ with $|B'|\le k$ such that for all $a\in A$ we have $E(a)\cap B'\neq\emptyset$ (that is, $B'$ dominates $A$ wrt. $E$ ), or
there is a set $A'\subseteq A$ with $|A'|\le k$ such that for all $b\in B$ we have $E^*(b)\cap A'\neq A'$ (that is, $A'$ dominates $B$ wrt. $\neg E$ ).

We first state a corollary of von Neumann's minimax theorem (also of Farkas' lemma, the Hahn-Banach theorem, or of the strong duality theorem for linear programming).

Von Neumann's minimax theorem is usually stated in terms of a two player zero-sum game. Suppose $A,B$ are two finite sets and $f \colon A\times B\to\mathbb R$ is a payoff function, which can be viewed as an $A\times B$ matrix. Consider the following game between two players, $A$ and $B$ :

Player $A$ picks an element $a\in A$ (a row in the matrix)
Player $B$ picks an element $b\in B$ (a column in the matrix),
Player $A$ pays the value $f(a,b)$ to player $B$ .

Of course, $A$ wants to minimize $f(a,b)$ , while $B$ wants to maximize it. The following inequality expresses that the player who gets to pick last is better off:

\max_{b\in B}\min_{a\in A} f(a,b)\le \min_{a\in A} \max_{b\in B}f(a,b).

It is easy to construct an example function $f$ (consider the $2\times 2$ identity matrix) where this inequality is strict.

Now we allow $A$ and $B$ to use randomized strategies:

Player $A$ picks a probability distribution $\mu_A\colon A\to[0,1]$ on $A$ ,
Player $B$ picks a probability distribution $\mu_B\colon B\to[0,1]$ on $B$ ,
Randomly select $a\in A$ and $b\in B$ according to the distributions, and Player $A$ pays to player $B$ the value $f(a,b)$ .

Player $A$ wants to minimize the expected value of $f(a,b)$ , while player $B$ wants to maximize it. The following result expresses that if the players play optimally, it doesn't matter which player picks their distribution first.

Theorem (von Neumann's minimax theorem) Let $A,B$ be finite sets and $f\colon A\times B\to\mathbb R$ . Then:

\max_{\mu_B}\min_{\mu_A} \mathbb E_{a\sim \mu_A}\mathbb E_{b\sim \mu_B} f(a,b)=\min_{\mu_A} \max_{\mu_B} \mathbb E_{a\sim \mu_A}\mathbb E_{b\sim \mu_B} f(a,b).

Note that if player $A$ picks their distribution $\mu_A$ first, the optimal distribution $\mu_B$ for $B$ can be chosen to be concentrated on a single element $b\in B$ . Similarly with the roles reversed. So the above can be strengthened to:

\max_{\mu_B}\min_{a\in A} \mathbb E_{b\sim \mu_B} f(a,b)=\min_{\mu_A} \max_{b\in B} \mathbb E_{a\sim \mu_A}\mathbb f(a,b).

The value in the equation above is called the value of the game with payoff $f$ .

Exercise. Using the minimax theorem, prove the equality of the fractional packing number $\nu^*$ and fractional transversal number $\tau^*$ for a set system.

Corollary (of minimax Theorem) Let $E\subseteq A\times B$ be a binary relation with $A,B$ finite. Then exactly one of the following two cases holds:

there is some probability distribution $\mu_B$ on $B$ such that $\mu_B(E(a))\ge 1/2$ holds for all $a\in A$
there is some probability distribution $\mu_A$ on $A$ such that $\mu_A(A-E^*(b))\ge 1/2$ holds for all $b\in B$ .

Proof of Corollary. Define a payoff function $f\colon A\times B\to\{0,1\}$ with $f(a,b)=1$ if and only if $(a,b)\in E$ . Consider the value of the game with payoff $f$ . If the value is at least $1/2$ , we have that $\max_{\mu_B}\min_{a\in A} \mathbb E_{b\sim \mu_B} f(a,b)\ge 1/2$ . Then the first case holds.

Otherwise, if the value is less than $1/2$ , we have that $\min_{\mu_A} \max_{b\in B} \mathbb E_{a\sim \mu_A}\mathbb f(a,b)<1/2$ , equivalently, $\min_{\mu_A} \max_{b\in B} \mathbb E_{a\sim \mu_A}\mathbb (1-f(a,b))\ge 1/2$ . Then the second case holds. $\blacksquare$

Proof of Corollary of the $(p,q)$ -theorem Set $\varepsilon:= 1/2$ . Fix $d\in\N$ , and let $k$ be the number from the $\varepsilon$ -net theorem, with $k\le O(\frac{d}{\varepsilon}\log(\frac {1}\varepsilon))=O(d)$ . By the Minimax Theorem applied to $\alpha= 1/2$ , one of the two cases below holds.

Case 1: There is some probability distribution $\mu_B$ on $B$ such that $\mu_B( E(a))\ge 1/2$ holds for all $a\in A$ .

Let $B'\subseteq B$ with $|B'|\le k$ be an $\varepsilon$ -net. Since $\mu_B( E(a))\ge 1/2$ for all $a\in A$ , we have that $B'\cap E(a)$ is nonempty for all $a\in A$ . Therefore, the first condition in the statement is satisfied.

Case 2: There is some probability distribution $\mu_A$ on $A$ such that $\mu_A(E^*(b))< 1/2$ holds for all $b\in B$ . Then $\mu_A(A-E^*(b))\ge 1/2$ holds for all $b\in B$ . By a dual argument (since the set system $\{A-E^*(b):b\in B\}$ has the same VC-dimension as $\{E^*(b):b\in B\}$ ), we can pick an $\varepsilon$ -net $A'\subseteq A$ with $|A'|\le k$ , so that $A'\cap (A-E^*(b))$ is nonempty for all $b\in B$ . Therefore, the second condition in the statement is satisfied. $\blacksquare$

Sample compression schemes

This is based on the original paper by Moran and Yehudayoff.

Fix a set family $\cal F$ on a finite domain $D$ . Let $S_d(\cal F)$ denote the set of labeled samples of size $d$ , that is, pairs $S:=(S_+,S_-)$ of multisets of elements of $D$ with $|S_+|+|S_-|=d$ , such that there exists some $F\in \cal F$ containing $S_+$ and disjoint from $S_-$ . For a labeled sample $S$ and set $Z\subset D$ , let $S[D]$ denote the sample $(S_+\cap D,S_-\cap D)$ . If $S$ and $T$ are two samples, we say that $T$ is a sub-sample of $S$ if $T_+\subset S_+$ and $T_-\subset S_-$ .

Let $S_\omega(\cal F)$ denote the union of all $S_d(\cal F)$ , for $d\ge 0$ .

Fix a family $\cal F$ of VC-dimension $d$ , for some fixed $d$ .

A sample compression scheme for $\cal F$ , of size $k$ , consists of two mappings: a compression mapping, and a reconstruction mapping, defined as follows.

Compression

The compression map maps each $Y:=(Y_+,Y_-)\in S_\omega(\cal F)$ to a pair $(Z,w)$ , where $Z\subset Y$ is a sub-sample of $Y$ of size $k$ and $w\in \{0,1\}^k$ is a bit-sequence describing some auxiliary information.

Reconstruction

A reconstruction function maps a given pair $(Z,w)$ , where $Z\in S_k(\cal F)$ and $w\in \{0,1\}^k$ , to a concept $C\subset 2^D$ .

We require that the composition of the two mappings – start with $Y\in S_\omega(\cal F)$ , compress it to some $(Z,w)$ , reconstruct some concept $C$ – returns a concept $C\in 2^D$ that contains $Y_+$ and is disjoint with $Y_-$ .

Recall that $\textup{VCdim}^*(\cal F)$ is the dual VC-dimension of $\cal F$ .

Theorem Let $\cal F$ be a set system on a finite domain, and $d=\textup{VCdim}(\cal F)$ and $d^*=\textup{VCdim}^*(\cal F)$ . Then $\cal F$ has a sample compression scheme of size $O(d\cdot d^*)$ .

Proof By the VC-theorem, there is $s=O(d)$ and a learning map $f:S_s(\cal F)\to \cal F$ such that for all $F\in\cal F$ and $\mu$ on $D$ there is $Z\subset\text{supp}(\mu):=\{v\in D|\mu(v)>0\}$ with $|Z|\le s$ such that

\mu(F\triangle f(F\cap Z,F-Z))\le 1/3.

Let $(Y_+,Y_-)\in S_\omega(\cal F)$ , and let $Y=Y_+\cup Y_-$ .

The key lemma is the following.

Lemma There are $T\le O(d^*)$ sets $Z^1,\ldots,Z^T\subseteq Y_+\cup Y_-$ , each of size at most $s$ , such that for every $x\in Y_+\cup Y_-$ we have:

x\in Y_+\qquad\iff \qquad \#\{t\in \{1,\ldots,T\}\mid x\in f(Y[Z^i]) \}>T/2

With this lemma, obtaining the compression scheme is straightforward, as we now describe.

Compression We assume an arbitrary total order on $D$ is fixed. To compress $(Y_+,Y_-)$ , we map it to $(Z,w)$ where $Z$ is the union of the sets as in the statement of the lemma, and $w$ is a bit-sequence of length $T$ , whose $i$ -th bit indicates whether the $i$ -th element of $Z$ (in the fixed order on $D$ ) belongs to $Z^i$ . This needs to be fixed slightly

Decompression Given $(Z,w)$ , we first compute the sets $Z^1,\ldots,Z^T$ , using $Z$ and the bit-sequence $w$ . We then define a concept $C$ as the set of all $x\in D$ such that

\qquad \#\{t\in \{1,\ldots,T\}\mid x\in f(Y[Z^i]) \}>T/2.

It follows immediately from the lemma that this is a sample compression scheme of size $O(dd^*)$ .

We now prove the lemma.

Proof of lemma Denote

{\cal G}:={\cal G}_Y=\{f(Y_+\cap Z,Y_-\cap Z)\mid Z\subset Y, |Z|\le s\}\subseteq \cal F.

For a concept $C\in \cal G$ and element $x\in Y_+\cup Y_-$ , say that $C$ is correct on $x$ if $x\in C\iff x\in Y_+$ holds.

By definition of $f$ , for every probability distribution $\mu$ on $Y$ there is $G\in \cal G$ such that

\mu(\{x\in Y\mid \text{$G$ is correct on $x$}\})\ge 2/3.

By the Minimax theorem, there is a probability distribution $\nu$ on $\cal G$ such that for every $x\in Y$ we have

\nu(\{G\in {\cal G}\mid \text{$G$ is correct on $x$}\})\ge 2/3.

By the VC-theorem, there is an $1/8$ -approximation for $\nu$ on $\cal G^*$ : a multiset $\cal H\subset \cal G$ of size $T\le O(d^*)$ such that for all $x\in Y$ we have:

\frac{\#\{H\in{\cal H}\mid \text{$H$ is correct on $x$} \}}{\#\cal H}\ge \nu(\{G\in{\cal G}\mid \text{$G$ is correct on $x$}\})-\frac 1 8\ge \frac 23-\frac 18>\frac 12.

Taking $Z^1,\ldots,Z^T$ to be the elements of $\cal H$ , this proves the lemma, and concludes the theorem. $\blacksquare$

Haussler's packing theorem

Fix a domain $D$ and a probabilistic distribution on $D$ . For two sets $X,Y\subseteq D$ , write $\textup{dist}_{\mathbb P}(X,Y)$ to denote the measure $\mathbb P[X\triangle Y]$ of the symmetric difference of $X$ and $Y$ . This defines a metric on $2^D$ .

Fix $\varepsilon>0$ . We are intersted in the problem of packing a maximal number of pairwise disjoint (closed) balls of radius $\varepsilon$ in this metric. Equivalently, of finding the maximal size of a subset of a $2\varepsilon$ -separated subset $\cal S\subseteq \cal F$ , that is, a set $\cal S$ of elements with mutual distance larger than $2\varepsilon$ .

Example. Suppose we are packing discs of radius $0<\varepsilon<1$ with center in the $n\times n$ square in $\mathbb R^2$ . For a disc $D$ with center in $[0,n]\times [0,n]$ , the area of $D\cap ([0,n]\times [0,n])$ is at least $\pi\varepsilon^2/4=\Omega(\varepsilon^2)$ . Thus, we can pack at most $O((n/\varepsilon)^2)$ disjoint discs.

Haussler's packing theorem says that for set systems $\cal F$ of VC-dimension $d$ , the behaviour is similar as in the $d$ -dimensional Euclidean space.

Theorem Let $d\ge 1$ and $C$ be constants, and let $\cal F$ be a set system on a finite domain $D$ whose primal shatter function satisfies $\pi_{\cal F}(m)\le Cm^d$ for all $m\in\N$ . Let $0<\varepsilon<1$ and fix a probability distribution on $D$ . If $\cal F$ is $\varepsilon$ -separated then $|{\cal F}|\le O(1/\varepsilon^d)$ .

Corollary Let $\cal F$ be a set system on a finite domain $D$ and $d=\textup{VCdim}(\cal F)$ . Let $k$ be a positive integer. If $|A\triangle B|\ge k$ for all $A,B\in\cal F$ then $|{\cal F}|\le O((n/k)^d)$ .

This follows from the theorem applied to $\varepsilon=k/n$ and the uniform distribution on $D$ , by the Sauer-Shelah-Perles lemma. Conversely, for $k=1$ , the Corollary recovers the SSP lemma, although with worse hidden constants.

Our proof of Haussler's theorem follows a recent proof by Kupavskii and Zhivotovskiy.

We will prove the following lemma, which quickly yields the theorem.

Lemma ( $\ast$ ). Fix $0<\delta<1$ and $\varepsilon>0$ . Let $\cal F$ be an $\varepsilon$ -separated set system on $D$ , and let $\mu$ be a probability distribution on $D$ . Fix $m=\frac {2\text{VCdim}(\cal F)}{\varepsilon\delta}$ and let $S$ be a sample of $m$ independently chosen elements of $D$ according to the distribution $\mu$ . Then

\#{\cal F}\le \frac{\mathbb E_S[\#\cal F|_S]}{1-\delta}.

In particular, if $\pi_{\cal F}(m)\le Cm^d$ for all $m$ , we have that

\#{\cal F}\le \frac{\pi_{\cal F}(m)}{1-\delta}\le \frac{1}{1-\delta}\cdot Cm^d=C\frac{1}{1-\delta}\left(\frac{2\text{VCdim}(\cal F)}{\varepsilon\delta}\right)^d=C\frac{1}{(1-\delta)\delta^d}\cdot \frac {(2\text{VCdim}(\cal F))^d}{\varepsilon^{d}}

holds for all $0<\delta<1$ . The minimum, over $0<\delta<1$ of $\frac{1}{(1-\delta)\delta^d}$ is attained for $\delta=d/(d+1)$ , and is at most $e(d+1)$ . Hence we get

\#{\cal F}\le Ce(d+1) \left(\frac{2\text{VCdim}(\cal F)}{\varepsilon}\right)^d\le O(1/\varepsilon^d),

which proves the theorem. It therefore remains to prove the lemma.

For a set $S\subseteq D$ and $F\in\cal F$ denote by $F|_S$ the labeled sample $F|_S:=(S_+,S_-)=(F\cap S,S-F)$ .

Recall that, given $\varepsilon>0$ , the VC-theorem implies the existence of a learning function $f\colon S_m(\cal F)\to 2^D$ , where $m=O(d)\frac {1}{\varepsilon}\log\frac{1}{\varepsilon}$ , such that for every $F\in\cal F$ and random sample $S$ of $m$ elements of $D$ chosen independently with a given probability distruction $\mu$ , we have that

\mu[f(F|_S)\triangle F]<\varepsilon.

Using this, it is easy (and instructive) to prove Lemma ( $\ast$ ) with a weaker bound, where instead of $m=\frac {2\text{VCdim}(\cal F)}{\varepsilon\delta}$ we use $m=O(d)\frac 1{\varepsilon}\log \frac 1\varepsilon$ . This would would yield the bound

\#{\cal F}\le O\left(\frac 1{\varepsilon}\log \frac 1\varepsilon\right)^d,

which was obtained originally by Dudley. [To obtain Dudley's bound, proceed as in the proof Lemma ( $\ast$ ) presented below, but instead of the learning function obtained in the proposition below, use the learning function obtained from the VC-theorem.]

In order to obtain the improved bound as stated in Lemma $(\ast)$ , we need a learning function with the number of sample elements of the form $m=O_\delta(\frac 1 \epsilon)$ , rather than $m=O_{\delta}(\frac 1\epsilon\log \frac 1\epsilon)$ . This might come at a worse dependency on $\delta$ , which we don't care about, because we consider $\delta$ to be fixed, say $\delta=\frac 1 2$ . This is achieved by the following proposition.

Proposition Let $\cal F$ be a set system on a domain $D$ . Fix $m=\frac{2\text{VCdim}(\cal F)}{\varepsilon\delta}$ . There is a "learning function" $f\colon S_m(\cal F)\to 2^D$ such that the following holds. For every $F\in\cal F$ and random sample $S$ of $m$ elements of $D$ chosen independently with a given probability distribution $\mu$ , we have that

\mathbb \mu[f(F|_S)\triangle F]<\frac\varepsilon 2

holds with probability at least $1-\delta$ over the sample.

The following lemma is key. Given a set system $\cal F$ , consider its Hamming graph $H_{\cal F}$ , with vertices $\cal F$ and edges connecting two sets $F,G\in \cal F$ if and only if $|F\triangle G|=1$ . The edge density of a graph $(V,E)$ is $|E|/|V|$ .

Lemma The Hamming graph $H_{\cal F}$ of $\cal F$ has edge density at most $\textup{VCdim}(\cal F)$ .

The proof of the lemma is by repeating the proof of the Sauer-Shelah-Perles lemma and observing that it also yields this conclusion. See also Geometric Discrepancy by Matousek, Lemma 5.15.

This implies that the same holds for every induced subgraph of $H_{\cal F}$ .

Fact Let $G$ be a graph such that every induced subgraph $H$ of $G$ has edge density at most $d$ . Then there is an orientation of the edges of $G$ such that every node has outdegree at most $d$ .

Remark This is very easy to prove with $2d$ instead of $d$ : repeatedly remove a vertex $v$ of degree at most $2d$ , and orient its edges towards the remaining vertices. This can be improved to $d$ using Hall's marriage theorem (see Alon, Tarsi '92).

Corollary There is an orientation of the edges of $H_{\cal F}$ such that every node has outdegree at most $\textup{VCdim}(\cal F)$ .

Proof of Proposition Let $d_0=\text{VCdim}(\cal F)$ . Fix a probability distribution $\mu$ on $D$ . We will show that there is a function $f\colon S_m(\cal F)\to 2^D$ such that

\mathbb E_S\left[ \mu(f(F|_S)\triangle F)\right]<\frac {d_0} m

where the expectation is with respect to $S$ .

Markov's inequality then gives the conclusion:

\mathbb P_S\left[\mu(f(F|_S)\triangle F)\ge \frac \varepsilon 2\right]\le \frac {\mathbb E_S[\mu(f(F|_S)\triangle F)]}{\varepsilon/2}<\frac {2d_0}{m\varepsilon}=\delta.

We proceed to defining the learning function $f$ . Fix a set $S=\{s_1,\ldots,s_m\}$ . Suppose that we are given the labelling $F|_S$ , for some $F\in\cal F$ , and we want to guess the label of $s_0$ , for some $s_0\in D$ .

Let $S':=S\cup\{s_0\}$ . Fix an orientation of $H_{\cal F}|_{S'}$ with out-degree $d_0=\textup{VCdim}(\cal F)$ . If there is only one $G\in {\cal F}|_{S'}$ that agrees with $F|_S$ , then $G=F$ and we label $s_0$ according to $G$ . If there are two concepts $G,H\in {\cal F}|_{S'}$ that agree with $F|_S$ then $G$ and $H$ are adjacent in $H_{{\cal F}|_{S'}}$ . We label $s_0$ according to the one which is the out-neighbor of the other.

This way, by ranging $s_0$ over all elements of $D$ , we define a concept $f(F|_S)$ , basing on the labeled sample $F|_S$ .

We argue that for all $F\in \cal F$ , we have:

\mathbb{E}_S[ \mu(f(F_S)\triangle F)]\le \frac {d_0}{m+1}<\frac {d_0} m.

Fix $F\in\cal F$ . We have that

\mathbb E_{S}[\mu(f(F_S)\triangle F)]=\frac{1}{m+1}\mathbb E_{S'}[\#((H_i\triangle F)\cap S')],

where $S'=\{s_0,\ldots,s_m\}$ is a random sample of size $m+1$ , and we define $m+1$ concepts $H_0,\ldots,H_{m}$ , by leaving the $i$ th element $s_i$ out and applying $f$ as defined above to the labeled sample of size $m$ . We have:

{\#((H_i\triangle F)\cap S')}\le {\textup{out-deg}(F|_{S'})}\le d_0

This gives the required bound $\frac {d_0}{m+1}<\frac {d_0}m$ . $\blacksquare$

We now prove Lemma $(\ast)$ , recalled below.

Lemma ( $\ast$ ). Fix $0<\delta<1$ and $\varepsilon>0$ . Let $\cal F$ be an $\varepsilon$ -separated set system on $D$ , and let $\mu$ be a probability distribution on $D$ . Let $m=\frac {2\text{VCdim}(\cal F)}{\varepsilon\delta}$ and let $S$ be a sample of $m$ independently chosen elements of $D$ according to the distribution $\mu$ . Then

\#{\cal F}\le \frac{\mathbb E_S[\#\cal F|_S]}{1-\delta}.

Proof For $F\in\cal F$ and a set $S$ of $m$ elements, say that $F$ is well approximated on $S$ if $\mu(f(F|_S)\triangle F)<\varepsilon/2$ , where $f$ is the learning function from the proposition.

We first show that for any fixed set $S$ of size $m$ ,

\#\{F\in\cal F:\text{$F$ is well approximated on $S$}\}\le \#\cal F|_S.

To prove this, suppose $F|_S=G|_S$ . Let $\hat F=f(F|_S)$ and $\hat G=f(G|_S)$ . Since $\cal F$ is $\varepsilon$ -separated we have $\varepsilon\le \mu(F\triangle G).$ As $\hat F=\hat G$ , by the triangle inequality we have:

\varepsilon\le \mu(F\triangle G)\le \mu(F\triangle \hat F)+ \mu(G\triangle \hat G)

so one of those must be at least $\varepsilon/2$ . So at most one of $F,G$ is well approximated on $S$ , which proves the inequality above.

The above inequality can be written as follows, where $F\in\cal F$ is chosen uniformly at random:

\mathbb P_F [\text{$F$ is well approximated on $S$}]\le \frac{\#\cal F|_S}{\#\cal F}.

In particular, if $S$ is a random sample of $m$ elements chosen independently with any given distribution $\mu$ , we have:

\mathbb E_S\mathbb P_F [\text{$F$ is well approximated on $S$}]\le \frac{\mathbb E_S[\#\cal F|_S]}{\#\cal F},

On the other hand, by the proposition, for every fixed $F\in\cal F$ we have that $F$ is well approximated on a random sample $S$ of $m$ elements with probability at least $1-\delta$ . In particular,

\mathbb E_F\mathbb P_S [\text{$F$ is well approximated on $S$}]\ge 1-\delta.

As the left-hand sides in the two inequalities are equal, we get that $1-\delta\le \frac{\mathbb E_S[\#\cal F|_S]}{\#\cal F}$ , which yields the lemma. $\blacksquare$

Zarankiewicz problem and incidence geometry

The Zarankiewicz problem asks about the largest number of edges in a bipartite graph with parts of sizes and $m,n$ that avoids a fixed biclique $K_{s,t}$ as a subgraph. Let $z(m,n;s,t)$ denote this number.

The Kővári–Sós–Turán theorem gives an upper bound.

Theorem. (Kővári–Sós–Turán, 1954)

z(m,n;s,t)<(s-1)^{1/t}(n-t+1)m^{1-1/t}+(t-1)m=O_{s,t}(mn^{1-1/s}+n)

z(n,n;t,t)=O_t(n^{2-1/t}).

We prove the latter inequality, for an arbitrary graph: every $n$ -vertex graph that avoids $K_{t,t}$ as a subgraph, has at most $O(n^{2-1/t})$ edges.

Proof: See Conlon's notes. $\square$

Example We show that there is a $K_{2,2}$ -free graph matching the bound $O(n^{3/2})$ . Thus, for $s=t=2$ , the bound is optimal.

Let $p$ be a prime and $\mathbb F_p$ be the field with $p$ elements. Let $P$ be the set of $p^2$ points in $\mathbb F_p^2$ , and $L$ be the set of $p^2$ lines in $\mathbb F_p^2$ of the form $\{(x,y)\in\mathbb F_p^2:y=ax+b\}$ , for some $a,b\in \mathbb F_p$ . Since every line contains $p$ points, there are $p^3$ point-line incidences. This graph excludes $K_{2,2}$ , since two distinct lines can have only one point in common. Thus we have a bipartite $K_{2,2}$ -free graph with $n=2p^2$ vertices and $\Theta(n^{3/2})$ edges, matching the Kővári–Sós–Turán bound. This holds for all $n$ since for large enough $n$ there is a prime between $\sqrt n-n^{1/3}$ and $\sqrt n$ (this is a very deep result). $\square$

It turns out that for bipartite graphs arising as incidence graphs of points and lines in the plane, we get better bounds. Given a collection $P$ of points and $L$ of lines in $\mathbb R^2$ , let $I(P;L)$ denote the number of pairs $(p,\ell)\in P\times L$ such that $p\in\ell$ . Let $I(m;n)$ denote the maximum of $I(P;L)$ , over all $m$ -element $P$ and $n$ -element $L$ .

Theorem (Szemerédi-Trotter, 1983).

I(m;n)\le O(m^{2/3}n^{2/3}+m+n),

and this bound is asymptotically tight.

The bound for $m\le n^2$ or $n\le m^2$ follows from the Kővári–Sós–Turán. We follow a proof for the balanced case, of $m=n$ , based on Matoušek, Lectures on Discrete Geometry, Chapter 4.5. This uses the Cutting Lemma for lines in the plane.

We now prove a generalization of the Kővári–Sós–Turán bound, for graphs of bounded VC-dimension. We follow this paper.

Theorem 1 (Fox, Pach, Sheffer, Suk, Zahl 2017). Let $G=(A,B,E)$ be a bipartite graph with $|A|=m,|B|=n$ and such that the set system $\cal G=\{N(b)\mid b\in B\}$ satisfies $\pi_{\cal G}(z)\le c z^d$ for some constants $c,d$ and all $z$ . Then, if $G$ is $K_{t,t}$ -free, we have

|E|\le O_{c,d,t}(mn^{1-1/d}+n).

The Kővári–Sós–Turán bound is the same, with $1-1/t$ instead of $1-1/d$ .

Corollary. Fix $d_1,d_2,c\in\mathbb N$ . Let $G=(A,B,E)$ , where $A\subset \mathbb R^{d_1}$ and $B\subset \mathbb R^{d_2}$ are sets of size $m,n$ respectively, and $E\subseteq A\times B$ is defined by a boolean combination of $c$ polynomial equalities and inequalities with $d_1+d_2$ variables of degree at most $c$ . If $G$ is $K_{t,t}$ -free, then

|E|\le O_{d_1,d_2,c,t}(mn^{1-1/{d_2}}+n).

From the Corollary (for $d_1=d_2=2$ ) and a cutting lemma for algebraic curves in the plane, one can derive the following generalization of the Szemeredi-Trotter result.

Theorem 2. Let $G=(A,B,E)$ , where $A\subset \mathbb R^2$ and $B\subset \mathbb R^2$ are sets of size $m,n$ respectively, and $E\subseteq A\times B$ is defined by a boolean combination of $c$ polynomial equalities and inequalities with $4$ variables of degree at most $c$ . If $G$ is $K_{t,t}$ -free, then

|E|\le O_{c,t}(m^{2/3}n^{2/3}+m+n).

We only prove Theorem 1. This follows easily from the following generalization of Haussler's theorem.

Say that a family $\cal F$ of sets on an $n$ -element domain is $(k,\varepsilon)$ -separated, where $\varepsilon>0$ , if for every $k$ sets $F_1,\ldots,F_k\in \cal F$ we have

|(F_1\cup\cdots\cup F_k)-(F_1\cap\cdots\cap F_k)|\ge \varepsilon n.

In other words, for every tuple $F_1,\ldots,F_k\in\cal F$ , at least an $\varepsilon$ -fraction of elements $x$ , the membership of $x$ is not the same in all $F_i$ 's. So a $(2,\varepsilon)$ -separated family is the same as an $\varepsilon$ -separated family from the previous lecture (for the counting measure).

Recall that Haussler's theorem says that if $\cal F$ is an $\varepsilon$ -separated family with $\pi_{\cal F}=O(m^d)$ , then $|{\cal F}|\le O_d((\frac{1}{\varepsilon})^d)$ .

We now prove the same result for $(k,\varepsilon)$ -separated families.

Theorem 3 Fix $k\in\N$ and $\varepsilon,d>0$ . Let $\cal F$ be a $(k,\varepsilon)$ -separated family with $\pi_{\cal F}=O(m^d)$ . Then $|{\cal F}|\le O_{k,d}((\frac{1}{\varepsilon})^d)$ .

The proof of Haussler's theorem relied on the following.

Proposition. Let $\varepsilon,\delta>0$ and $m=\frac{\textup{VCdim(\cal F)}}{\varepsilon\delta}$ . There is a "learning function" $f\colon S_m({\cal F})\to 2^D$ such that for every $F\in\cal F$ and random sample $S$ of $m$ elements of $D$ chosen independently with a given probability distribution $\mu$ , we have that

\mathbb \mu[f(F|_S)\triangle F]<\varepsilon

holds with probability at least $1-\delta$ over the sample.

We now prove Theorem 3.

Proof. Let $\cal F$ be a $(k,\varepsilon)$ -separated family with $\pi_{\cal F}=O(m^d)$ .

Fix a parameter $0<\delta<1$ . Apply the proposition to $\varepsilon/k$ , obtaining $m=\frac{k\textup{VCdim}({\cal F})}{\varepsilon\delta}$ and a learning function $f\colon S_m({\cal F})\to 2^D$ . Let $\mu$ denote the counting measure on the domain, with $\mu(F)=|F|/n$ .

Say that a set $F\in \cal F$ is well-approximated on a set $S$ of size $m$ if $\mu[f(F|S)\triangle F]\le\varepsilon/k$ . The Proposition says that every $F$ is well-approximated on a random $S$ with probability at least $1-\delta$ .

Therefore, for a random $m$ -tuple $S$ , the expected number of sets $F$ that are well-approximated on $S$ is at least $\#{\cal F}(1-\delta)$ :

1-\delta\le \frac 1 {\#\cal F}\sum_{F\in\cal F}\mathbb E_S [F \text{ is well approximated on }S]=\frac 1{\#\cal F}\mathbb E_S\sum_{F\in \cal F}[F \text{ is well approximated on }S].

On the other hand, for a given $S$ , there are at most $(k-1)\cdot \#({\cal F}|_S)$ sets that are well approximated on $S$ .

Indeed, for any $k$ -tuple of sets $F_1,\ldots,F_k$ with the same restriction to $S$ , not all of them are well approximated on $S$ . Let $K=f(F_1|_S)$ . Then

(F_1\cup\ldots\cup F_k)-(F_1\cap \cdots \cap F_k)\subseteq (F_1\cup\cdots \cup F_k)\triangle K=(F_1\triangle K)\cup\cdots\cup (F_k\triangle K),

so if all were well approximated, then we would have that the RHS has measure at most

\varepsilon

, and so does the LHS, contradicting the assumption that

\cal F

(k,\varepsilon)

-separated.

Hence, for a fixed set $P\in \cal F|_S$ , there at most $k-1$ sets in $\cal F$ that restrict to $P$ , and are well approximated on $S$ . Therefore, the number of sets $F$ that are well-approximated on any $S$ is at most $(k-1)\cdot \#({\cal F}|_S)$ .

We get that the expected number $X$ of sets $F$ that are well-approximated on $S$ satisfies

\#{\cal F}\cdot (1-\delta)\le X \le (k-1)\cdot \mathbb E_S \#({\cal F}|_S)\le O_k(\pi_{\cal F}(m)).

Since $\delta$ is a fixed constant, this proves

\#{\cal F}\le O_k(\pi_{\cal F}(m)),

as required.

\blacksquare

We now prove Theorem 1.

Theorem 1 Let $G=(A,B,E)$ be a bipartite graph with $|A|=m,|B|=n$ and such that the set system $\cal G=\{N(b)\mid b\in B\}$ satisfies $\pi_{\cal G}(z)\le c z^d$ for some constants $c,d$ and all $z$ . Then, if $G$ is $K_{t,t}$ -free, we have

|E|\le O_{c,d,t}(mn^{1-1/d}+n).

Proof

Claim. There is $d=O(m/n^{1/d})$ and a set $B'\subseteq B$ with $|B'|= t$ such that at most $d$ elements $a\in A$ are non-homogeneous towards $B'$ .

Proof. Let $c$ be the multiplicative factor from Theorem 3, and fix $b$ to be defined below. Suppose that for every $t$ -tuple $B'$ , at least $d:=b m/n^{1/d}$ elements $a\in A$ are non-homogeneous towards $B'$ . Then $\cal G$ is $(k,b/n^{1/d})$ -separated, so by Theorem 3, we have $|\cal G|\le c\cdot n/b^d=n\cdot c/b^d$ , whereas $|\cal G|\ge n$ . Picking $b$ so that $c<b^d$ we get a contradiction. Thus, we may take $d=bm/n^{1/d}$ . $\square$

Since $G$ is $K_{t,t}$ -free, it follows that every vertex $b\in B'$ has degree at most $d+t$ in $G$ . We may remove any such vertex $b$ and repeat. Therefore, $|E|\le n\cdot (d+t)=O(mn^{1-1/d}+n)$ , as required. $\blacksquare$

Matchings with small crossing number

We follow Geometric Discrepancy, Jiri Matousek, Chapter 5.4. We prove Theorem 5.17 from there:

Theorem Let $\cal F$ be a set system on a $2n$ -element domain $D$ , with $\pi_{\cal F}(m)\le c\cdot m^d$ for some constants $c,d\ge 1$ . Then there exists a perfect matching on $D$ whose crossing number is at most $O_{c,d}(n^{1-1/d}+\log n)$ .

Here, the crossing number of a graph $(V,E)$ with respect to $\cal F$ is the maximum, over all $F\in\cal F$ , of the number of pairs $\{u,v\}\in E$ with $u\in F$ and $v\notin F$ .

Corollary Let $\cal F$ be a set system on an $n$ -element domain $D$ , with $\pi_{\cal F}(m)\le c\cdot m^d$ for some constants $c,d\ge 1$ . Then there exists a path on $D$ whose crossing number is at most $O_{c,d}(n^{1-1/d}+\log(n))$ .

The proof of the corollary (using the theorem) is left as an exercise.

Geometric range searching

Geometric range searching is an area of computational geometry, which stimulated a large research effort in the mid 80's-mid 90's. Many of the results that we have seen in the lecture so far were stimulated by problems in this area (e.g. $\varepsilon$ -nets, cuttings, packing lemma, matchings with low crossing numbers). This area has deep connections with important problems in mathematics and in computer science. See Geometric range searching by Jiri Matousek for a survey.

In this area, a set system $(D,\cal F)$ is called a range space, and the sets in $\cal F$ are called ranges. Typical range spaces considered in this area are:

(halfspace range searching) half-spaces in $\mathbb R^d$
(orthogonal range searching) axis-parallel boxes in $\mathbb R^d$
(simples range searching) simplices in $\mathbb R^d$
discs in $\mathbb R^d$ .

For a fixed range space $(D,\cal F)$ , consider the following problem. We are given a finite set $P\subset D$ of points, and our task is to compute a data structure that allows to answer the following queries: given a range $F\in\cal F$ :

(range emptiness) is $F\cap P$ nonempty?
(range counting) compute $\#(F\cap P)$ ,
(range reporting) enumerate $F\cap P$ .

Those problems are motivated by practical applications, e.g. in databases.

A common generalization of the above problems asks about sums in a commutative semigroup $(S,+)$ . Here we assume that $P$ is given together with a weight function $w\colon P\to S$ , and the data structure allows to answer queries: given a range $F\in\cal F$ , compute the weighted sum

\sum_{v\in F\cap P}w(v).

This generalizes the above, by taking:

(range emptiness) $S=(\{0,1\},\lor)$ , $w(v)=1$ for $v\in P$ ,
(range counting) $S=(\mathbb N,+)$ , $w(v)=1$ for $v\in P$ ,
(range reporting/enumeration) $S=(2^D,\cup)$ , $w(v)=\{v\}$ for $v\in P$ .

For semigroups like in the first two cases, we assume the unit cost model, where semigroup elements can be represented in constant space and operations take constant time. For the last semigroup, this is an unreasonable assumption, and we need to pick different representations of sets (such as lists or arrays).

Intervals on the line

Let us first consider a very simple example of range spaces: the set of intervals on the line $\mathbb R$ .

Let $P\subset \mathbb R$ be a set of $n$ points, and assume that they are sorted.

For the range emptiness and counting problems, we can use a very simple solution: we precompute in time $O(n)$ , for every $a\in A$ the number $n_a$ of elements in $P$ that are $\le a$ . Then $[a,b]\cap P$ is then the difference of the sums $n_b-n_{a'}$ , where $a'$ is the predecessor of $a$ in $P$ . The query time is $O(\log n)$ if we allow arbitrary intervals $[a,b]$ as queries, and $O(1)$ if we only consider intervals with endpoints in $P$ .

The above solution also works for computing weights in some semigroups -- specifically those, which embed into a group. But it does not work for all semigroups, for instance, for $(\mathbb N,\max)$ .

For range reporting, we just store the elements of $P$ in a doubly-linked list. The query time is still $O(\log n)$ or $O(1)$ if we agree to represent outputs as enumerators, and $O(\log n+|output|)$ or $O(|output|)$ if we wish to output the elements.

Note that we can always have a trivial solution that precomputes all possible answers in time $O(n^2)$ , and answers queries in time $O(\log n)$ or $O(1)$ . This brute-force approach also works for many other range search problems (such as for axis-parallel boxes), and the game is minimize the memory usage and/or pre-processing time, while having small query time.

The following solution works for all semigroups (for the interval range space). In the preprocessing phase, construct a balanced (rooted, ordered) binary tree $T$ , whose leaves are the points of $P$ , and the usual left-to-right order on the leaves agrees with the usual order on $P$ . Hence, $T$ has height $\Theta(\log n)$ . We will use $T$ as a binary search tree. With each node $v$ of $T$ we associate a canonical interval $I_v=[a,b]$ , where $a$ and $b$ are the smallest and largest leafs below $v$ .

Now, given any interval $I=[a,b]$ (where $a,b\in\mathbb R$ ), we can find using binary search two points $a',b'\in P$ such that $I\cap P=[a',b']\cap P$ . Moreover, $I\cap P$ is a disjoint union of $O(\log n)$ canonical intervals, each intersected with $P$ . Those canonical intervals can be computed while performing the binary search, in time $O(\log n)$ .

We can precompute the total weight $w(I_v\cap P)$ of every canonical interval $I_v$ in time $O(n)$ , while constructing the tree $T$ . Since the total weight of $I\cap P$ is the sum of the total weights of the canonical intervals (as the union is disjoint), we can compute it in time $O(\log n)$ , given $I$ .

Note that this reasoning works in every semigroup in the unit cost model. The preprocessing time is $O(n)$ (or $O(n\log n)$ if the points are not sorted) and query time is $O(\log n)$ .

Example As mentioned, for $(\mathbb N,+)$ , or semigroups that embed in groups, we may improve the query time to $O(1)$ , assuming the queried intervals are of the form $[a,b]$ for $a,b\in P$ (otherwise we must spend $\Omega(\log n)$ time even for emptiness).

Another semigroup where we can achieve some speedup is $(\mathbb N\cup\{+\infty\},\min)$ .

This can be formulated in terms of range minimum queries.

We assume the RAM model in the following, and that each input number fits into a single word.

Theorem Given a list $a_1,\ldots,a_n$ of numbers, we can compute in time $O(n)$ a data structure that allows to answer the range minimum queries in time $O(1)$ : given $1\le i\le j\le n$ , output $\argmin(a_i,\ldots,a_j)$ .

This result depends quite delicately on the computation model. If we assume the pointer machine model (where we can follow pointers in constant time, but cannot manipulate those pointers), then there is a lower bound which says that the above is not possible -- there is a $\Omega(\log \log n)$ lower bound for query answering (Harel, Tarjan, 1984).

Simplex range searching

In all the considered geometric range search problems, the range space on $\mathbb R^d$ is fixed, and $d$ is considered to be a small, fixed constant.

Simplex range searching appears to be an important case to which many other geometric range search problems can often be reduced to. Below, $\tilde O(\cdot)$ hides a factor $(\log n)^{O(1)}$ .

Informal statement. Let $P\subset \mathbb R^r$ be an $n$ -point set and $m$ be a parameter with $n\le m \le n^d$ . The simplex range search problems and halfspace range problems can be solved with preprocessing time and space $\tilde O(m)$ and query time $\tilde O(n/m^{1/d})$ , Moreover, the query time is optimal, given the preprocessing time and space $m$ .

For example, on one extreme, for $m=n$ we get preprocessing time $\tilde O(n)$ and query time $\tilde O(n^{1-1/d})$ , which is necessary. On the other extreme, for $m=n^d$ we get preprocessing time $\tilde O(n^d)$ and query time $\tilde O(1)$ .

Logarithmic query time

Consider the halfspace range problem. Recall that for an $n$ -point set in $\mathbb R^d$ , we have $O(n^d)$ different sets $P\cap F$ , for different half-spaces $F$ . The solution is to precompute the answers for all possible $P\cap F$ , and then can answer queries in time $O(1)$ . However, to implement this, we also need to be able to find $P\cap F$ , given $F$ .

This can be reduced (by duality) to the point location problem:

Given a subdivision of $\mathbb R^d$ into convex cells, construct a data structure that allows to quickly find the cell containing any given point.

For partitions of $\mathbb R^d$ given by hyperplane arrangements, there is a solution with $O(n^d)$ preprocessing and $O(\log n)$ query time (Chazelle 1993). This relies on cuttings.

Recall:

Theorem (Chazelle, Friedman) Let $H$ be a set of $n$ hyperplanes in $\mathbb R^d$ and let $r$ be a parameter, with $1< r\le n$ . There is a (1/r)-cutting of $\mathbb R^d$ into $O(r^d)$ simplices with disjoint interiors, covering $\mathbb R^d$ , such that no simplex is intersected by more than $1/r$ hyperplanes.

Linear space

Consider the halfspace range problem in two dimensions. That is, the ranges are half-planes in $\mathbb R^2$ .

Early solutions

Early solutions are based on the ham-sandwich theorem, and on partition trees.

Suppose for simplicity that the set $P$ has size $n$ which is a power of $4$ . Find a line $\ell$ that dives $P$ into two sets of size $n/2$ . By the ham-sandwich theorem, there is a line $\ell'$ such that each of the four regions $A_1,\ldots,A_4$ formed by $\ell$ and $\ell'$ contains $n/4$ points.

In each of those regions $A_i$ , proceed similarly, finding four regions $A_{ij}$ , each with $n/4^2$ elements of $P$ , and so on.

This way, we get a rooted tree $T$ of branching $4$ , which can be used to solve the halfplane range problem. The observation is that any halfplane $H$ , and for any node of our tree $T$ , one of the four associated regions is not cut by $H$ , and so is either contained in $H$ , or disjoint with $H$ .

Therefore, we report $\emptyset$ or the precomputed set $A\cap P$ for this region $A$ , and proceed recursively with the remaining three regions.

From this we get a query time

T(n)\le C+3T(n/4),

which resolves to $T(n)=n^{\log_4 3}\approx n^{0.79}$ .

Spanning trees with low crossing numbers

The work of Haussler, Welzl, Chazelle has lead to a huge improvement over the earlier ideas based on ideas similar to the one above.

Recall the result from the previous lecture/tutorial:

Fact Let $\cal F$ be a set system on an $n$ -element domain $D$ , with $\pi_{\cal F}(m)\le c\cdot m^d$ for some constants $c,d\ge 1$ . Then there exists a path on $D$ whose crossing number is at most $O(n^{1-1/d}+\log n)$ .

In our case (for halfplanes), we have $\pi_{\cal F}(m)\le O(m^2)$ , so we get a path $p_1-p_2-\ldots-p_n$ on $P$ with crossing number $O(\sqrt n)$ . Moreover, such a path can be effectively computed (our construction was randomized, this can be derandomized).

Therefore, a given halfspace $H$ partitions our point set $P$ into $O(\sqrt n)$ sets $P_1,\ldots,P_k$ , where each part is either contained in $H$ or is disjoint with $H$ , and forms an interval in our path. Therefore, we have reduced the problem to $O(\sqrt n)$ interval problems, which can be solved jointly in time $\tilde O(\sqrt n)$ .

Note there is a caveat in the above solution: given a halfplane $H$ we need to be able to efficiently compute the edges of the path that are cut by $H$ . This can be solved efficiently for $d=2$ and $d=3$ (Chazelle, Welzl 1992), but for higher $d$ , this seems as difficult as the original problem.

Simplicial partitions

For higher $d$ , the solution was given by Matousek (1992), basing on cuttings. See Matousek, Geometric range searching, p. 20-21.

Using simplicial partitions, we can construct a partition tree similarly as earlier. Now the recursion is:

T(n)\le f(r)+\kappa\cdot T(2n/r),

where $\kappa=O(r^{1-1/d})$ is the crossing number of the simplicial partitions and $f(r)$ is the computation cost spent in each node of the tree, with $f(r)\le O(r)$ .

We must find an optimal value of $r$ . If we take a large constant for $r$ , we get:

T(n)\le O(n^{1-1/d+o_r(1)}).

By choosing $r$ as a function of $n$ appropriately, we get

T(n)\le \tilde O(n^{1-1/d})

and pre-processing time $O(n\log n)$ .

Szemeredi Regularity Lemma

Let $G$ be a graph and $A,B\subset V(G)$ . Let $E(A,B)$ be the set of edges with one endpoint in $A$ and one endpoint in $B$ , and define the density between $A$ and $B$ as

d(A,B)=\frac{|E(A,B)|}{|A||B|}.

Say that the pair $(A,B)$ (where $A$ and $B$ are not necessarily disjoint) is $\varepsilon$ -regular if for every $A'\subset A$ and $B'\subset B$ , if $|A'|\ge \varepsilon |A|$ and $|B'|\ge \varepsilon |B|$ , then

|d(A',B')-d(A,B)|\le \varepsilon.

Say that a partition $\cal P$ of $V(G)$ is $\varepsilon$ -regular if

\sum_{(A,B)}\frac{|A||B|}{n^2}\le \varepsilon,

where the sum ranges over all pairs $(A,B)\in \cal P^2$ that are not $\varepsilon$ -regular (possibly $A=B$ ).

Theorem (Szemeredi regularity lemma) For every $\varepsilon>0$ there is a number $N$ such that for every graph $G$ there is an $\varepsilon$ -regular partition of $V(G)$ with at most $N$ parts.

For notational convenience, we state (and prove) the result in a more general setting. We will consider graphs that are weighted in two ways:

instead of an edge set $E\subseteq {V \choose 2}$ , we may have a symmetric function $E\colon V\times V\to [0,1]$ ,
instead of counting with respect to the counting measure, we can have a probabilistic measure (distribution) $\mu$ on $V$ , $\mu\colon V\to[0,1]$ with $\mu(V)=1$ , where we write $\mu(A)$ for $\sum_{a\in A}\mu(a)$ .

Given $(V,E,\mu)$ as above, we may define:

the edge density between two sets $A,B\subset V$ as the expected value, over all $a\in A$ and $b\in B$ sampled with distribution $\mu$ , of

d(A,B):=\mathbb E_{a\in A}\mathbb E_{b\in B}E(a,b)=\frac{\sum_{a\in A,b\in B}\mu(a)\mu(b)E(a,b)}{\mu(A)\mu(B)}.

With this notion, we can define the notion of an $\varepsilon$ -regular pair $A,B$ as previously, and say that a partition $\cal P$ of $V$ is $\varepsilon$ -regular if

\sum_{(A,B)}{\mu(A)\mu(B)}\le \varepsilon,

where the sum ranges over pairs $(A,B)\in\cal P^2$ that are not $\varepsilon$ -regular. Thus, the probability that a pair $(a,b)\in V^2$ belongs to an $\varepsilon$ -regular pair $(A,B)\in\cal P^2$ is at least $1-\varepsilon$ .

In the following, fix $V, E$ and $\mu$ as above. Our goal is to prove that $V$ has an $\varepsilon$ -regular partition with $O_\varepsilon(1)$ parts.

We will use the following inequality.

Fact Let $s_1,\ldots,s_k\ge 0$ and $\mu_1,\ldots,\mu_k>0$ . Then

\frac{(\sum_i s_k)^2}{\sum_i \mu_i}\le \sum_i\frac{ s_k^2}{\mu_i}.

Proof Follows from the Cauchy-Schwartz inequality $(\sum_i{a_ib_i})^2\le \sum_i a_i^2\cdot \sum_i b_i^2$ by taking $a_i=s_i/\sqrt{\mu_i}$ and $b_i=\sqrt{\mu_i}$ . $\square$

Lemma Let $A,B\subseteq V$ , $\cal A$ be a partition of $A$ and $\cal B$ be a partition of $B$ . Then

d(A,B)^2\mu(A)\mu(B)\le \sum_{A'\in {\cal A},B'\in {\cal B}}d(A',B')^2\mu(A')\mu(B').

Proof By the fact above, we have:

\sum_{A',B'}d(A',B')^2\mu(A')\mu(B') =\sum_{A',B'}\frac{(\sum_{a\in A'}\sum_{b\in B'}\mu(a)\mu(b)E(a,b))^2}{\mu(A')\mu(B')}\ge

\ge\frac{(\sum_{A',B'}\sum_{a\in A'}\sum_{b\in B'}\mu(a)\mu(b)E(a,b))^2}{\sum_{A',B'}\mu(A')\mu(B')}=\frac{(\sum_{a\in A,b\in B}\mu(a)\mu(b)E(a,b))^2}{\mu(A)\mu(B)}=d(A,B)^2\mu(A)\mu(B).

$\square$

For a partition $\cal P$ of $V(G)$ define the potential of $\cal P$ as

potential({\cal P}):=\sum_{A,B\in \cal P}d(A,B)^2\mu(A)\mu(B) .

Lemma If $\cal Q$ is a refinement of $\cal P$ then $potential(\cal P)\le potential(\cal Q)$ .

Proof Follows from the previous lemma, by considering all pairs $A,B\in \cal P$ and adding up the inequalities (where for $\cal A$ and $\cal B$ we take the restrictions of $\cal Q$ to $A$ and $B$ respectively). $\square$

Lemma We have:

0\le potential(\cal P)\le 1.

Proof Because $\sum_{A,B\in \cal P}\mu(A)\mu(B)=1$ and $0\le d(A,B)\le 1$ for all $A,B$ . $\square$

Lemma If a pair $(A,B)$ is not $\varepsilon$ -regular then there is a partition $A=A_1\uplus A_2$ and $B=B_1\uplus B_2$ such that:

\sum_{i,j\in\{1,2\}}\frac{d(A_i,B_j)^2\mu(A_i)\mu(B_j)}{\mu(A)\mu(B)}\ge d(A,B)^2+\varepsilon^4

Proof Since $(A,B)$ is not $\varepsilon$ -regular, there are sets $A_1\subset A$ , $B_1\subset B$ such that $\mu(A_1)\ge \varepsilon \mu(A)$ and $\mu(B_1)\ge \varepsilon\mu(B)$ and

|d(A_1,B_1)-d(A,B)|>\varepsilon.

Set $A_2:=A-A_1$ , $B_2:=B-B_1$ , and $\delta:=d(A,B)=\sum_{i,j}\frac{\mu(A_i)\mu(B_j)}{\mu(A)\mu(B)d(A_i,B_j)}$ . Then we have:

\varepsilon^4\le \frac{\mu(A_1)}{\mu(A)}\frac{\mu(B_1)}{\mu(B)}\cdot (d(A_1,B_1)-\delta)^2\le

\le \sum_{i,j}\frac{\mu(A_i)\mu(B_j)}{\mu(A)\mu(B)}\cdot(d(A_i,B_j)-\delta)^2=

=\sum_{i,j}\frac{\mu(A_i)\mu(B_j)}{\mu(A)\mu(B)}\left(d(A_i,B_j)^2-2\delta d(A_i,B_j)+\delta^2\right)=

=\sum_{i,j}\frac{\mu(A_i)\mu(B_j)}{\mu(A)\mu(B)}d(A_i,B_j)^2-2\delta \underbrace{\sum_{i,j}\frac{\mu(A_i)\mu(B_j)}{\mu(A)\mu(B)}d(A_i,B_j)}_\delta+\underbrace{\sum_{i,j}\frac{\mu(A_i)\mu(B_j)}{\mu(A)\mu(B)}}_{1}\delta^2=

=\sum_{i,j}\frac{\mu(A_i)\mu(B_j)}{\mu(A)\mu(B)}d(A_i,B_j)^2-\delta^2.

This proves the lemma. $\square$

Lemma Suppose $\cal P$ is not $\varepsilon$ -regular. Then there is a refinement $\cal Q$ of $\cal P$ such that $|\cal Q|\le |\cal P|\cdot 4^{|\cal P|}$ and

potential(\cal Q)\ge potential(\cal P)+\varepsilon^5.

Proof Let

I=\{\{A,B\}:A,B\in\cal P, (A,B)\text{ not }\varepsilon\text{-regular}\}

be the set of irregular pairs in

\cal P

. For each

\{A,B\}\in \cal I

consider the partition

A=A_1^{AB}\uplus A_2^{AB}

and

B=B_1^{AB}\uplus B_2^{AB}

from the previous lemma, so that

\sum_{i,j\in\{1,2\}}\mu(A_i^{AB})\mu(B_j)^{AB}d(A_i^{AB},B_j^{AB})^2\ge (d(A,B)^2+\varepsilon^4)\mu(A)\mu(B).

Consider the refinement $\cal Q$ of $\cal P$ , which is the partition into the cells of the Venn diagram of the family of sets ${\cal P}\cup \bigcup_{\{A,B\}\in I}\{A_1^{AB},A_2^{AB}\}$ . Then every $A\in\cal P$ is refined into at most $2^{2|\cal P|}$ parts, so $|\cal Q|\le |\cal P|\cdot 2^{2|\cal P|}$ . For all $\{A,B\}\in I$ we have:

\sum_{\substack{A',B'\in \cal Q\\A'\subseteq A,B'\subseteq B'}}\mu(A')\mu(B')d(A',B')^2\ge (d(A,B)^2+\varepsilon^4)(\mu(A)+\mu(B)),

and for $A,B\in \cal P$ such that $\{A,B\}\notin I$ we have

\sum_{\substack{A',B'\in \cal Q\\A'\subseteq A,B'\subseteq B'}}\mu(A')\mu(B')d(A',B')^2\ge (d(A,B)^2+0)(\mu(A)+\mu(B)).

Summing those inequalities for all $A,B\in\cal P$ , we get:

\sum_{A',B'\in\cal Q}\mu(A')\mu(B')d(A',B')^2\ge \sum_{A,B\in\cal P}d(A,B)^2\mu(A,B)+\sum_{\{A,B\}\in I}\varepsilon^4\mu(A)\mu(B)

=potential({\cal P})+\varepsilon^4\sum_{\{A,B\}\in I}\mu(A)\mu(B).

The sum in the last expression is $\ge \varepsilon$ since each pair $\{A,B\}\in I$ is not $\varepsilon$ -regular. Therefore we get $potential(\cal Q)\ge potential(\cal P)+\varepsilon^5$ . $\square$

Proof of Szemeredi regularity lemma. Fix $\varepsilon>0$ and let $G$ be a graph. Let $\cal P_0$ be the partition of $V(G)$ with one part, and for $i\ge 1$ , if $\cal P_{i-1}$ is not already $\varepsilon$ -regular, let $\cal P_i$ be the refinement of $\cal P_{i-1}$ obtained from the previous lemma. Then $|\cal P_{i}|\le f(i)$ for some fixed function $f$ (which is bounded by a tower of $4$ 's of height polynomial in $i$ ), and $potential(\cal P_i)\ge i\cdot \varepsilon^5$ . Since $potential(\cal P_i)\le 1$ for all $i$ , we have that the sequence $\cal P_0,\cal P_1,\ldots$ may have length at most $1/\varepsilon^5$ . Therefore, the last element of this sequence is an $\varepsilon$ -regular partition of size at most $f(1/\varepsilon^5)$ . $\square$ .

A stronger statement

We now formulate and prove a slightly stronger statement, which is convenient in applications. The statement says that

all the parts have equal size, apart from one part,
there are at least $m$ parts, where $m$ is a given parameter.

Theorem (variant of Szemeredi regularity lemma) For every $\varepsilon>0$ and $m\in\N$ there is a number $N=N(\varepsilon,m)$ such that for every graph $G$ there is an $\varepsilon$ -regular partition of $V(G)$ with $m\le |{\cal P}|\le N$ and such that all parts of $\cal P$ , apart form one, have equal size.

The proof is very similar to the previous proof, but in each step of the refinement process we additionally reorganize the obtained partition to obtain a partition into parts of equal size (apart from one).

Regularity for graphs of bounded VC-dimension

There are a couple of improvements of the regularity lemma that one could wish to have:

improve the bound on the number $N$ of parts, which in the proof above is a tower of exponentials of height polynomial in $\frac{1}{\varepsilon}$
remove irregular pairs
have all pairs (apart from a small set of irregular pairs) with density $\le \varepsilon$ or $\ge 1-\varepsilon$
have all pairs (apart from a small set of irregular pairs) with density $0$ or $\ge 1$ .

It was proven by Gowers that the function $N$ cannot be improved in general (it needs to be a tower of exponentials of height polynomial in $\frac{1}{\varepsilon}$ ). Also, we will see that one cannot get rid of the irregular pairs. We will see that for graphs of bounded VC-dimension, we can get both the first and third improvement. We will also see that for stable graphs, we can furthermore get the second improvement.

Say that a pair $A,B$ of sets of vertices is $\varepsilon$ -homogeneous if $d(A,B)\le \varepsilon$ or $d(A,B)\ge 1-\varepsilon$ .

Lemma If $A,B$ is $\varepsilon^3$ -homogeneous then $A,B$ is $\varepsilon$ -regular.

Proof Exercise.

We will prove the following.

Theorem Fix reals $c,d\ge 1$ and $0<\varepsilon<1/4$ . Let $G$ be a graph with $\pi_{{\cal N}(G)}(m)\le c m^d$ for all $m\in\N$ . There is a partition $\cal P$ of $V(G)$ of size $O(1/\varepsilon^{2d+1})$ , with all parts of equal size ( $\pm 1$ ), such that at most $\varepsilon |\cal P|^2$ pairs $(A,B)\in\cal P^2$ are not $\varepsilon$ -homogeneous.

For a pair $A,B$ of subsets of $V(G)$ , let $T_{AB}$ be the set of triples $(a,b,b')\in A\times B\times B$ such that $ab\in E(G)$ and $ab'\notin E(G)$ .

Lemma For a pair $A,B$ of subsets of $V(G)$ with $|A|=|B|=m$ we have that

|T_{AB}|+ |T_{BA}|\ge \varepsilon(1-\varepsilon)\cdot m^3.

Proof Let $\varepsilon_A=|T_{AB}|/m^3$ and $\varepsilon_B=|T_{BA}|/m^3$ . Our goal is to prove that $\varepsilon_A+\varepsilon_B\ge \varepsilon(1-\varepsilon)$ .

Pick $a,a'\in A$ and $b,b'\in B$ independently and randomly (with uniform measure). Then

2\varepsilon(1-\varepsilon)\le \mathbb P[ab\in E\text{ xor }a_1b_1\in E].

Note that the event $ab\in E\text{ xor }a_1b_1\in E$ implies that at least one of the two following events holds:

$ab\in E\text{ xor }a_1b\in E$ , or
$a_1b\in E\text{ xor }a_1b_1\in E$ .

These events have probabilities at most $2\varepsilon_A$ and $2\varepsilon_B$ , respectively. Hence, by the union bound, we have that $2\varepsilon(1-\varepsilon)\le \mathbb P[ab\in E\text{ xor }a_1b_1\in E]\le 2(\varepsilon_A+\varepsilon_B)$ , which gives the required conclusion. $\square$

Proof of the theorem

Let

\delta:=\frac{\varepsilon^2n}{16}.

Claim. There is a partition $\cal Q$ of $V(G)$ such that $|{\cal Q}|\le O(\frac n\delta)^d$ and $|N(u)\triangle N(v)|\le 2\delta$ for all $u,v\in V(G)$ that belong to the same part of $\cal Q$ .

Namely, greedily find $v_1,v_2,\ldots\in V(G)$ such that $|N(v_i)\triangle N(v_j)|\ge \delta$ , until no longer possible. Haussler's packing theorem says that this will stop after at most $O(\frac \delta n)^d$ steps. For every $v\in V(G)$ there is some $i$ such that $|N(v)\triangle N(v_i)|\le \delta$ . Therefore, there is a partition $A_1,A_2,\ldots$ of $V(G)$ of size at most $O(\frac n \delta)^d$ such that any two vertices $u,v$ in the same part satisfy $|N(u)\triangle N(v)|\le 2\delta$ , by the triangle inequality. This proves the claim.

Let

K:=\frac{16|\cal Q|}{\varepsilon}.

Partition

\cal P

into

K

parts of size

n/K

so that each part of

\cal P

, apart from

\le |\cal Q|

parts, is contained in some part of

\cal Q

. This can be achieved by greedily chopping each part of

\cal Q

into parts of size

n/K

, and rearranging the leftovers into parts of size

n/K

. Let

\cal R

denote those parts of

\cal P

that are not contained in any part of

\cal Q

, so that

|\cal R|\le |\cal Q|

Let

I=\{(A,B): A,B\in \cal P-\cal R, (A,B) \text{ is not }\varepsilon{-homogeneous}\}

We will prove that

|I|\le \frac 23 \varepsilon \cdot K^2.

This will prove the theorem, since:

the number of pairs $(A,B)\in \cal P$ such that $(A,B)$ is not $\varepsilon$ -homogeneous is at most $|I|+2|{\cal R}||{\cal P}|\le \frac 2 3\varepsilon \cdot K^2 +\frac 1 {16}\varepsilon K^2\le \varepsilon K^2.$
the number of parts in $\cal Q$ is at most $K=16|\cal Q|/\varepsilon\le O (n/\delta)^d/\varepsilon\le O(1/\varepsilon^{2d+1}).$

Proof of Claim. Let $T=\bigcup\{T_{AB}:(A,B)\in I\}.$ Recall that if $v,v'$ are in the same part of $\cal P$ then $|N(v)\triangle N(v')|\le 2\delta$ . Morevoer, there are $K$ parts, each of size $N/K$ . It follows that $|T|\le 2\delta (n/K)^2\cdot K=\frac {1}8\frac {\varepsilon^2n^3}{K}$ .

On the other hand, by the lemma we proved earlier, $|T_{AB}|\ge \varepsilon(1-\varepsilon)(\frac{n}{K})^3$ for all $A,B\in I$ . Hence,

\frac {1}8\frac {\varepsilon^2n^3}{K}\ge |T|\ge \sum_{(A,B)\in I}|T_{AB}|\ge |I|\cdot \varepsilon(1-\varepsilon)(\frac{n}{K})^3.

Rearranging, and by $\varepsilon<1/4$ , we get $|I|\le \frac 2 3 \varepsilon K^2$ . $\square$

Stable regularity

A ladder in a graph $G$ of length $k$ consists of two sequences $a_1,\ldots,a_k$ and $b_1,\ldots,b_k$ of vertices of $G$ , such that

a_ib_j\in E(G)\iff i\le j.

Note that we pose no requirements on the adjacencies between $a_i$ and $a_j$ , nor between $b_i$ and $b_j$ . The ladder index of a graph $G$ is the maximal length of a ladder in $G$ .

Our goal is to prove that graphs $G$ with ladder index bounded by a fixed constant $\ell$ enjoy a particularly well-behaved regularity lemma.

Theorem (Malliaris, Shelah, 2011) For every $\varepsilon>0$ and $\ell\in\N$ there is a number $N=N(\varepsilon,\ell)$ such that every sufficiently large graph $G$ of ladder index at most $\ell$ has a partition into parts of equal size $(\pm 1)$ into at most $N$ parts, such that:

there are no irregular pairs,
every pair is $\varepsilon$ -regular with density $\le\varepsilon$ or $\ge(1-\varepsilon)$
the number of parts of the partition is at most polynomial in $1/\varepsilon$ (the degree of the polynomial depends on $\ell$ ).

We first develop a key tool for working with graphs of bounded ladder index.

Branching index

A tree configuration of depth $d$ in $G$ is a complete binary tree $T$ of depth $d$ with inner nodes labeled by distinct vertices of $G$ and leaves labeled by distinct verttices of $G$ , with the following property. For any inner node $u$ and its descendant leaf $v$ ,

uv\in E(G)\iff v \text{ is a right descendant of }u.

Here, when writing $uv\in E(G)$ we implicitly refer to the corresponding vertices of $G$ . Therefore, $u$ is adjacent (in $G$ ) to all its right descendant leaves, and non-adjacent to all its left descendant leaves.

The largest depth $d$ of a tree configuration in $G$ is the branching index of $G$ . The branching index and the ladder index can be bounded in terms of each other.

Lemma If $G$ contains a ladder of length $\ell$ then $G$ has branching index at least $\log_2(\ell)$ .

Proof Suppose we have a ladder $a_1,\ldots,a_\ell,b_1,\ldots,b_\ell$ Construct a balanced binary tree whose leaves, from left to right, are labeled with $b_1,\ldots,b_\ell$ , and where every inner node is labelled by $a_i$ , whenever $b_i$ is the left-most leaf that is a descendant of the right child of $a_i$ . This is a tree configuration of depth at least $\log_2(\ell)$ . $\square$

Proposition If $G$ has a tree configuration of depth $2^{k+1}-2$ then $G$ contains a ladder of length $k$ .

A set $A$ of inner nodes of $T$ contains a tree configuration of depth $a$ if there is a subset $A'\subseteq A$ of $A$ such that $A'$ with the ancestor relation is isomorphic to that of a complete binary tree of depth $a$ . Note that if this is the case, then $T$ has a tree configuration of depth $a$ whose inner nodes are the elements of $A'$ .

Lemma If $T$ is a tree configuration of depth $a+b$ and the inner nodes of $T$ are partitioned into two sets $A\uplus B$ , then either $A$ contains a tree configuration of depth $a$ , or $B$ contains a tree configuration of depth $b$ .

Proof of Proposition

A configuration $C_\ell$ is:

( $C_\ell$ ):

\begin{matrix} a_1&a_2&\ldots& a_{q-1}&\cdot & a_q&\ldots&a_{\ell-1}&a_\ell\\ b_1&b_2&\ldots& b_{q-1}&\cdot & b_q&\ldots&b_{\ell-1}&b_\ell\\ && & &T && & & \end{matrix}

such that $a_1,\ldots,a_\ell$ , $b_1,\ldots,b_\ell$ form a ladder of length $\ell$ , $T$ is a tree configuration $T$ of depth $2^{k+1-\ell}-2$ , and $0\le q\le \ell$ is a position, such that:

every inner node $b$ of $T$ is adjacent in $G$ to all vertices $a_1,\ldots,a_{q-1}$ and non-adjacent to all vertices $a_q,\ldots,a_\ell$
every leaf $a$ of $T$ is non-adjacent in $G$ to all vertices $b_1,\ldots,b_{q-1}$ and adjacent to all vertices $b_q,\ldots,b_\ell$ .

We prove that there is a configuration $C_\ell$ , by induction on $\ell=0,1,\ldots,k$ . We have a configuration $C_0$ by assumption. The existence of a configuration $C_k$ yields the conclusion of the proposition.

Suppose we have a configuration $C_\ell$ as above; we prove that then we have a configuration $C_{\ell+1}$ .

Let $b$ be the root of $T$ , and $b'$ be its right child. Fix a leaf $a$ of $T$ which is a right descendant of $b'$ . Let $T_0$ denote the subtree of $T$ rooted at the left child of $b'$ . Let $N$ denote the set of inner nodes of $T_0$ that are neighbors of $a$ in $G$ , and let $N'$ denote the set of inner nodes of $T_0$ that are non-neighbors of $a$ in $G$ .

By the previous lemma we have that either $N$ or $N'$ contains a tree coinfiguration of depth $2^{k-\ell}-2$ . This is because $2(2^{k-\ell}-2)=2^{k-\ell+1}-4=\textit{height}(T)-2=\textit{height}(T_0)$ ,

Suppose that we have that $N$ contains a tree configuration of depth $2^{k-\ell}-2$ . We can then find a tree configuration $T'$ of depth $2^{k-\ell}-2$ , such that

\begin{matrix} a_1&a_2&\ldots& a_{q-1}&a&\cdot & a_q&\ldots&a_{\ell-1}&a_\ell\\ b_1&b_2&\ldots& b_{q-1}&b'&\cdot & b_q&\ldots&b_{\ell-1}&b_\ell\\ && & &&T' && & & \end{matrix}

is a configuration $C_{\ell+1}$ .

Otherwise, by the previous lemma, we have that the set $N'$ of inner nodes of $T_0$ that are non-neighbors of $a$ , contains a tree configuration $T'$ of depth $2^{k-\ell}-1$ . Then we have the following configuration $C_{\ell+1}$ :

\begin{matrix} a_1&a_2&\ldots& a_{q-1}&\cdot&a & a_q&\ldots&a_{\ell-1}&a_\ell\\ b_1&b_2&\ldots& b_{q-1}&\cdot&b & b_q&\ldots&b_{\ell-1}&b_\ell\\ && & &T'& && & & \end{matrix}

$\square$

Regularity for stable graphs

We follow the proof from here.

opinion(v,A)=\begin{cases}0 &\text{if }|N(v)\cap A|\le \varepsilon |A|\\ 1 &\text{if }|N(v)\cap A|\ge (1-\varepsilon) |A|. \end{cases}

For example, every singleton set is clearly $\varepsilon$ -good, but it is not obvious that large $\varepsilon$ -good sets exist.

We say that a set $A$ is $\varepsilon$ -excellent if for every $\varepsilon$ -good set $B$ , we have that all but at most $\varepsilon|A|$ vertices of $A$ have the same opinion about $B$ . That is, there is a set $X_{A,B}\subset A$ with $|X_{A,B}|\le \varepsilon |A|$ such that $opinion(v,B)$ is equal to some constant $c\in\{0,1\}$ , for all $v\in A-X_{A,B}$ . We then define $opinion(A,B)\in\{0,1\}$ as $c$ .

Note that every $\varepsilon$ -excellent set is $\varepsilon$ -good (take $B$ to be singletons).

A stronger version of the Malliaris-Shelah theorem is the following.

every part is $\varepsilon$ -excellent,
the number of parts of the partition is at most polynomial in $1/\varepsilon$ (the degree of the polynomial depends on $\ell$ ).

This statement implies the previous one, as we have the following lemma.

Lemma For every $0<\varepsilon<\frac 1 2$ there exists $\delta=\sqrt{2\epsilon}$ such that every pair $(A,B)$ of $\delta$ -excellent sets is $\varepsilon$ -regular, with edge density $\le\varepsilon$ or $\ge (1-\varepsilon)$ .

Now, the crux of the proof of the theorem is the following lemma, stating that large $\varepsilon$ -excellent sets can be found in any set: every set $A$ contains a subset $A'$ whose size is a constant fraction of $A$ , which is $\varepsilon$ -excellent.

Lemma Suppose $d$ is the maximal depth of a tree configuration in $G$ , and $\varepsilon<\frac 1 {2^d}$ . For every $A\subset V(G)$ there exists an $\varepsilon$ -excellent subset $A'\subset A$ with $|A'|\ge \varepsilon^{d-1}|A|$ .

Proof Level by level, for each node $v$ of the complete binary tree of depth $d+1$ , we label $v$ by two sets, $A_v$ and $B_v$ , where:

$B_v$ is an $\varepsilon$ -good set witnessing that $A_v$ is not $\varepsilon$ -excellent,
at the root $r$ , $A_r=A$ ,
at any node $v$ with children $v_0$ , $v_1$ , we have that $A_{v_0}\uplus A_{v_1}$ is the partition of $A_v$ according to their majority opinion about $B_v$ .

Note that $|A_{v_i}|\ge \varepsilon |A_v|$ by construction. Therefore, $|A_v|\ge |A|\cdot \varepsilon^s$ for nodes at distance $s$ from the root. since

If we can find such a labelling up to the last level, we get a contradiction as follows. First, label each leaf $v$ by any element of $A_v$ .

For each inner node $u$ that is an ancestor of $v$ , $u$ has a majority opinion about $B_u$ , because $B_u$ is $\varepsilon$ -good. Let $X_{uv}\subset B_u$ denote the exceptions to that majority opinion. Let $X_{u}=\bigcup_{v}X_{uv}$ be the union over all leaf descendants of $u$ . Then $|X_u|<|B_u|$ because $\varepsilon<\frac 1 {2^d}$ . Label $u$ by any element of $B_u-X_u$ . We obtain a tree configuration of depth $d+1$ , a contradiction. $\square$

The rest of the proof of the regularity theorem of Malliaris and Shelah roughly proceeds by starting with $R=V$ , and repeatedly (for $i=1,2,...$ ) finding a maximal size subset $A_i$ of $R$ that is $\varepsilon$ -excellent, and then repeating the same with $R$ replaced by $R-A_i$ , until $R$ becomes sufficiently small. We then add $R$ to $A_1$ , which does not spoil its $\varepsilon$ -excellence, since $A_1$ is very large relative to $R$ . There are some additional difficulties that need to be overcome in order to get a partition into parts of equal size. See this paper for more details.

Permutation classes

Sources: articles by Wojciech Przybyszewski, part 1 and part 2.