AlgoDS: Lecture 3

Examples

In this lecture we give some example computational problems. These are not well known computational problems, but rather examples of what could we get into while analyzing data. They show useful techniques such as cumulative sums and bitmasks, though.

Example Problem 1: counting points in many overlapping rectangles

INPUT:
Arrays x and y of size n of integers, arrays x0, y0, x1, y1 of size k.
We know that 0≤x[i],x0[i],x1[i]<xM and 0≤y[i],y0[i],y1[i]<yM.
OUTPUT:
An array res of size k, such that res[i] is the number of j such that x0[i]≤x[j]<[i] and y0[i]≤y[j]<y1[i].

Geometrically, you are given n points and k rectangles, and you have to count the number of points in each rectangle.

Our "intended" application: you are writing an eye-tracking system, and you analyze which parts of the image people are looking at for the most time. The image is of size 1000x1000 (xM=yM=1000), there are n=1000000 points (multiple people looking at our picture for some time), and k=1000000 queries (we ask about each rectangle of some size). However, we state the problem in the abstract form, which could be used in many similar applications.

Solution 1: a Python program such as

[sum(x0[i]<=x[j]<x1[i] and y0[i]<=y[j]<y1[i] for j in range(n)) for k in range(k)]

Or a C++ program such as: (untested)

vector<int> res(k, 0);
for(int i=0; i<k; i++)
  for(int j=0; j<n; j++)
    if(x0[i]<x[j]<x1[i] && y0[i]<y[j]<y1[i])
      res[i]++;

We will analyze the time complexity of the C++ program. The last two lines run in time O(1) (a constant number of comparisons and additions). The last three lines run in time n*O(1), which is O(n). The last four lines run in k*O(n), which is O(kn). The first line needs to fill a vector with k zeros, which is O(k). The total running time is O(k) + O(kn) = O(kn). Our program uses memory O(1) (assuming that we do not count the size of the input and output into our memory complexity).

Parts of the Python program correspond to the parts of the C++ program -- it is a bit harder to see the steps, and where memory is used, but the complexity is the same.

For the size of data given in our "intended application", we get 1000000000000 -- it would need some time to finish.

Solution 2: We could try counting how many times each point appears in the data: (untested)

vector<vector<int>> v(yM, vector<int> (xM, 0));
for(int i=0; i<n; i++) counts[y[i]][x[i]]++;
vector<int> res;
for(int i=0; i<k; i++) {
  int sum = 0;
  for(int y=y0[i]; y<y1[i]; y++)
  for(int x=x0[i]; x<x1[i]; x++)
    sum += counts[y][x];
  res.push_back(sum);
  }

First line runs in time O(xM*yM), the second -- O(n), and counting runs in time O(k*xM*yM). Total time complexity is O(k*xM*yM+n), and memory complexity is O(xM*yM). For our data, it is roughly the same as Solution 1, though the constant hidden in the O notation might be a bit lower. Solution 3: However, we could use the counting solution given above to create a much better solution.

We will show the idea on a one-dimensional variant of our problem: we are given arrays c[n], x0[k], and x1[k]; for each k, compute res[k], the sum of c[i] for x0[k]≤i<x1[k]. This can be done in time O(n*k) as above, or in time O(n+k) in the following way: compute the vector cumsum of size n+1, where cumsum[i] is the sum of c[j] for j<i. It is easy to show that cumsum can be computed in O(n), and once we have computed cumsum, res[i] can be computed in O(1), using the formula res[k] = cumsum[x1[k]] - cumsum[x0[k]]!

Our problem is solved in the same way, but in two dimensions: (untested)

vector<vector<int>> v(yM+1, vector<int> (xM+1, 0));
for(int i=0; i<n; i++) counts[y[i]+1][x[i]+1]++;
// compute cumulative sums in each row
for(int y=0; y<=yM; y++) {
  int cs = 0;
  for(int x=0; x<=xM; x++) {
    int& x = counts[y][x];
    cs += x; x = cs;
    }
  }
// compute cumulative sums in each column
for(int x=0; x<=xM; x++) {
  int cs = 0;
  for(int y=0; y<=yM; y++) {
    int& x = counts[y][x];
    cs += x; x = cs;
    }
  }
vector<int> res;
for(int i=0; i<k; i++) 
  res[i].push_back(
    counts[y1[i]][x1[i]] + counts[y0[i]][x0[i]]
  - counts[y1[i]][x0[i]] + counts[y0[i]][x1[i]];

This works in time O(xM*yM+k+n), thus, for our "intended application", much faster than the previous algorithms. Memory complexity is still O(xM*yM).

This technique is called cumulative sums. In Python, cumulative sums can be computed with the function cumsum in the library numpy: (untested)

import numpy as np
counts = np.zeros([yM+1, xM+1])
for i in range(n):
  counts[y[i]+1][x[i]+1]+=1
np.cumsum(counts, axis=0, out=counts)
np.cumsum(counts, axis=1, out=counts)
res = [counts[y1[i]][x1[i]] + counts[y0[i]][x0[i]] - counts[y0[i]][x1[i]] - counts[y1[i]][x0[i]] for i in range(k)]

Example Problem 2: correlations

INPUT:
Array data of size n times k of 0/1s
OUTPUT:
An array res of size k times k, such that res[i][j] is the number of rows r of data such that r[i] and r[j] are both 1

Our intended application: we have information about a group of n=100000000 people, each of them may have or not k=15 binary properties (young/old, male/female, rich/poor, etc.) We want to see whether these properties are correlated, and as an intermediary step we need to compute the above.

As above, there are several solutions:

Solution 1: "Brute force" -- runs in time O(n*k*k). Could work, although it would take some time.

Solution 2: Use the counting technique. Note that there are 2^k = 32768 possible sets of properties that a given person could have. We enumerate these sets with numbers: a given row r corresponds to the set whose index is the sum of 2ⁱr[i] for all i. For each possible set index s, compute count[s], the number of people who have that set. Then, for each pair (i,j), sum the appropriate count[s] of sets s which include both i and j. This runs in time O(n+2^k*k*k), which is much faster. The details are left for the reader.

Solution 3: The solution above can be improved to O(n+2^k*k), using the technique similar to cumulative sums. Again, the details are left for the reader. Solutions 2 and 3 run significantly faster than Solution 1 (however, to be honest -- if we used it just for one instance, implementing them probably would take us much more time than we saved).

Corollary: It is sometimes said that algorithms with exponential running time are bad. Not necessarily -- we can use them very well if the data is small, or where the particular aspect of our data that causes exponential growup is small!