On the Bootstrap for Persistence Diagrams and Landscapes

Persistent homology probes topological properties from point clouds and functions. By looking at multiple scales simultaneously, one can record the births and deaths of topological features as the scale varies. In this paper we use a statistical technique, the empirical bootstrap, to separate topological signal from topological noise. In particular, we derive confidence sets for persistence diagrams and confidence bands for persistence landscapes.


Introduction
Persistent homology is a method for studying the homology at multiple scales simultaneously. Given a manifold X embedded in a metric space Y, we consider a probability density function p : Y → R, defined over Y but concentrated around X; that is, the density is positive for a small neighborhood around X and very small over Y \ X. For the right scale parameter t, the superlevel set p −1 ([t, ∞)) captures the homology of X. The problem, however, is that t is not known a priori. Persistent homology quantifies the topological changes of the superlevel sets with a multiset of points in the extended plane; we call this multiset the persistence diagram, and denote it by P.
Another way to represent the information contained in a persistence diagram is with the landscape function L : R → R, which can be thought of as a functional summary of P; we define these concepts in Section 1.1.
Computationally, it may be difficult to compute P or L directly. Instead, we assume that p corresponds to a probability distribution P , from which we can sample. Given a sample of size n, we create an estimate of the probability density function p n using a kernel density estimate. As n increases, p n approaches the true probability density. Given n large enough, we compute the persistence diagram P n and the landscape L n corresponding to p n .
Sometimes knowing the estimate of a persistence diagram or landscape is not enough. The bigger question is: How close is the estimated persistence diagram or landscape to the true one? We answer this question by constructing a confidence set for persistence diagrams and a confidence band for persistence landscapes.
A (1 − α)-confidence interval for a parameter θ is an interval [a, b] such that the probability P(θ ∈ [a, b]) is at least 1 − α. In our setting, we desire to find a confidence set for a persistence diagram P. To do so, we compute an estimated diagram P and and interval [0, c] such that the bottleneck distance between P and P is contained in [0, c] with probability 1 − α. That is, we find a metric ball containing P with high probability.
In this paper, we present the bootstrap, a method for computing confidence intervals, and we apply it to persistence diagrams and landscapes. After briefly reviewing the necessary concepts from computational topology, we give the general technique of bootstrapping in statistics in Section 1.2. In Section 2, we apply the bootstrap to persistence diagrams and landscapes, providing a few examples of these confidence intervals. We conclude in Section 2.3 with a discussion of our ongoing research and open questions.

Background
Before presenting our results, we review the necessary definitions and theorems from persistent homology. Then, we present the bootstrap. Due to space constraints, we cover the basics and provide references for a more detailed description.

Persistence Diagrams and Landscapes
Let Y be a metric space, for example. let Y be a compact subspace of R D . Suppose we have a probability density function p : Y → R concentrated in a neighborhood of a manifold X ⊆ Y. Persistent homology monitors the evolution of the generators of the homology groups of p −1 ([t, ∞)), the superlevel sets of p, and assigns to each generator of these groups a birth time (or scale) b and a death time d. The persistence diagram P records each pair (b, d) as the point ( b+d 2 , b−d 2 ); that is, the x-coordinate is the mid-life of the homological feature and the y-coordinate is the half-life or half of the persistence of the feature. 1 We refer the reader to Edelsbrunner and Harer [2010] for a more complete introduction to persistent homology.
Let D T be the space of positive, countable, T -bounded persistence diagrams; that is, for each point (x, y) = ( b+d 2 , b−d 2 ) ∈ P, we have 0 ≤ d ≤ b ≤ T and there are a countable number of points for which y > 0. We note here that each point on the line x = 0 is included in the persistence diagram P with infinite multiplicity. Letting W ∞ (P 1 , P 2 ) denote the bottleneck distance between diagrams P 1 and P 2 , the space (D, W ∞ ) is a metric space. We then have the following stability result from Cohen-Steiner et al. [2007] and generalized in Chazal et al. [2012]: Theorem 1.1 (Stability Theorem). Let M be finitely triangulable. Let f, g : M → R be two continuous functions. Then, the corresponding persistence diagrams P f and P g are well defined, and W ∞ (P f , P g ) ≤ f − g ∞ . Bubenik [2012] introduced another representation called the persistence landscape, which is in one-to-one correspondence with persistence diagrams. A persistence landscape is a continuous, piecewise linear function L : Z + × R → R. To define the persistence landscape function, we replace each persistence point p = (x, y) = b+d 2 , b−d 2 with the triangle function Notice that p is itself on the graph of t p (z). We obtain an arrangement of curves by overlaying the graphs of the functions {t p (z)} p∈P ; see Figure 1. The persistence landscape is defined formally as a walk through this arrangement: where kmax is the kth maximum value in the set; in particular, 1max is the usual maximum function. Observe that L P (k, z) is 1-Lipschitz. For the ease of exposition, we will focus on k = 1 in this paper, using L(z) = L P (1, z). However, the ideas we present in Section 2.2 hold for k > 1. Our definition of the persistence landscape is equivalent to the original definition given in Bubenik [2012].

The Standard Bootstrap
Introduced in Efron [1979], the bootstrap is a general method for estimating standard errors and for computing confidence intervals. We focus on the latter in this paper, but refer the interested reader to Efron et al. [2001], Davison and Hinkley [1997], and Van der Vaart [2000] for more details on the versatility of the bootstrap. Let X 1 , . . . , X n be independent and identically distributed random variables taking values in the measure space (X, X , P ). Suppose we are interested in estimating the real-valued parameter θ corresponding to the distribution P of the observation. We estimate θ using the statisticθ = g(X 1 , . . . , X n ), which is some function of the data. For example, θ andθ could be the population mean and the sample mean, respectively. The distribution of the differenceθ − θ contains all the information that we need to construct a confidence interval of level 1 − α for θ; that is, an interval [a, b] depending on the data X 1 , . . . , X n such that P(θ ∈ [a, b]) ≥ 1 − α. If we knew the cumulative distribution F ofθ − θ, then the quantiles F −1 (1 − α/2) and F −1 (α/2) can be computed. Furthermore, setting a =θ − F −1 (1 − α/2) and b =θ − F −1 (α/2), we immediately obtain a (1 − α)-confidence interval for θ: Unfortunately, the distribution ofθ − θ depends on the unknown distribution P .
In the first step in the bootstrap procedure, we approximate the unknown P with the empirical measure P n that puts mass 1/n at each X i in the sample. Let X * 1 , . . . , X * n be a sample of size n from P n . Equivalently, we can think of drawing X * 1 , . . . , X * n from X 1 , . . . , X n with replacement. We estimate the distribution F (r) with the distribution F (r) = P n (θ * −θ ≤ r), whereθ * = g(X * 1 , . . . , X * n ).
The distribution F is still not analytically computable, but can be approximated by simulation: for large B, obtain B different values ofθ * and approximate F (r), and hence F (r), with F (r) = 1 B B j=1 I(θ * j −θ ≤ r). Since the quantiles of F approximate the quantiles of F , we define the estimated confidence interval as In summary, the standard bootstrap procedure is: 1. Compute the estimateθ = g(X 1 , . . . , X n ).
4. Compute the quantiles of F and construct the confidence interval C n .
There are two sources of error in the Bootstrap procedure. We first approximate F with F and then we estimate F by simulation. The second error can be made arbitrarily small, by choosing B large enough. Therefore, this error is usually ignored in the theory of the bootstrap.
Formally, one has to show that sup r F (r) − F (r) P → 0 , which implies that the confidence interval C n , defined in (2), is asymptotically consistent at level 1 − α; that is, lim inf n→∞ P(θ ∈ C n ) ≥ 1 − α.

The Bootstrap Empirical Process
When a random variable is a function rather than a real value, the bootstrap procedure described above can be used to construct a confidence interval for the function evaluated at a particular element of the domain. Instead, we use the bootstrap empirical process, which can be used to find a confidence band for a function h(t); that is, we find a pair of functions a(t) and b(t) such that the probability that h(t) ∈ [a(t), b(t)] for all t is at least 1 − α. We describe this technique below, but refer the reader to Van der Vaart and Wellner [1996] and Kosorok [2008] for more details.
An empirical process is a stochastic process based on a random sample. Let X 1 , . . . , X n be independent and identically distributed random variables taking values in the measure space (X, X , P ). For a measurable function f : X → R, we denote P f = f dP and P n f = f dP n = n −1 n i=1 f (X i ). By the law of large numbers P n f converges almost surely to P f . Given a class F of measurable functions, we define the empirical process G n indexed by F as is the empirical distribution function seen as a stochastic process indexed by t.
The limit process is a Gaussian process G with zero mean and covariance function E Gf Gg := P f g − P f P g; this process is known as a Brownian Bridge. Let . . , X * n } is a bootstrap sample from P n , the measure that puts mass 1/n on each element of the sample {X 1 , . . . , X n }. The bootstrap empirical process G * n indexed by F is defined as Theorem 1.4 (Theorem 2.4 in Giné and Zinn [1990]). F is P -Donsker if and only if G * n converges in distribution to G in ∞ (F).
In words, Theorem 1.4 states that F is P -Donsker if and only if the bootstrap empirical process converges in distribution to the limit process G given in Definition 1.3. Suppose we are interested in constructing a confidence band of level 1 − α for {P f } f ∈F , where F is P -Donsker. Letθ = sup f ∈F |G n f |. We proceed as follows: 1. Draw X * 1 , . . . , X * n from P n and computeθ * = sup f ∈F |G * n f |.
A consequence of Theorem 1.4 is that, for large n and B, the interval [0, q α ] has coverage 1 − α forθ and the band C n (f ) f ∈F has coverage 1 − α for {P f } f ∈F .

Applications of the Bootstrap
In this section, we apply the bootstrap from the previous section to persistence diagrams, as well as to persistence landscapes.

Persistence Diagrams
Let X 1 , . . . , X n be a sample from the distribution P , supported on a smooth manifold function satisfying K(u)du = 1 and K(u) is nonnegative for all u; thus p h is a probability distribution. The function K is called a kernel and the parameter h > 0 is its bandwidth. Then p h is the density of the probability measure P h which is the Our target of inference in this section is P h , the persistence diagram of the superlevel sets of p h . The standard estimator for p h is the kernel density estimator notice that if X i are fixed, thenp h is a porbability distribution. Let P h be the corresponding persistence diagram. We wish to find a confidence set for P h , i.e. , an interval [0, c n ] such that lim sup n→∞ P(W ∞ ( P h , P h ) ∈ [0, c n ]) ≥ 1 − α. From Theorem 1.1 (Stability), it suffices to find c n such that lim sup n→∞ P( To find c n , we use the bootstrap.
. Using the notation of Section 1.3, it follows that P f The approximated 1−α quantile q α can be obtained through simulation, i.e., q α = inf{q : 1 B B j=1 I( √ n||p j n −p n || ≥ q) ≤ α}, where p j h (x) denotes the probability distribution corresponding to the j th bootstrap sample. The following result holds under suitable regularity conditions on the kernel K for which F is Donsker; see Giné and Guillou [2002].
Theorem 2.1 (Lemma 15 in Balakrishnan et al. [2013]). We have that By the Stability Theorem, we conclude: Example 2.2 (Torus). We embed the torus S 1 × S 1 in R 3 and we use the rejection sampling algorithm of Diaconis et al. [2012] (R = 1.5, r = 0.8) to sample 10, 000 points uniformly from the torus. Then, we compute the persistence diagram P h using the Gaussian kernel with bandwidth h = 0.25 and use the bootstrap to construct the 0.95% confidence interval [0 , 0.01] for W ∞ ( P h , P h ); see Figure 2. Notice that the confidence set correctly captures the topology of the torus. That is, only the points representing real features of the torus are significantly far from the horizontal axis.

Landscapes
Let the diagrams P 1 , . . . , P n be a sample from the distribution P over the space of persistence diagrams D T . Thus, by definition, we have x + y ≤ T < ∞ and 0 ≤ y ≤ T /2 for all (x, y) ∈ ∪ i P i . Let L 1 , . . . , L n be the landscape functions corresponding to P 1 , . . . , P n . That is, L i (t) = L P i (1, t), as defined in (1). We define the mean landscape µ(t) = E P [L i (t)], and the empirical mean landscape L n (t) = 1 n n i=1 L i (t). In this section, we show that the process √ n(L n (t) − µ(t)) converges to a Gaussian process, so that we may use the procedure given in Section 1.3.
Let F = {f t : 0 ≤ t ≤ T }, where f t : D → R is defined by f t (P) = L P (1, t). We note here that f t (P) = 0 if t / ∈ (0, T ). We can now write √ n(L n (t) − µ(t)) as an empirical process indexed by t ∈ [0, T ] : We note that the constant function F (P) = T /2 is a measurable envelope for F.
Given a probability measure Q over F, let f − g Q,2 = |f − g| 2 dQ and let N (F, L 2 (Q), ε) be the covering number of F, that is, the size of the smallest ε-net in this metric.
Lemma 2.3 (Theorem 2.5 in Kosorok [2008]). Let F be a class of measurable functions satisfying 1 0 log sup Q N (F, L 2 (Q), ε F Q,2 )dε < ∞ , where F is a measurable envelope of F and the supremum is taken over all finitely discrete probability measures Q with F Q,2 > 0. If P F 2 < ∞, then F is P -Donsker.
Theorem 2.4 (Weak Convergence of Landscapes). Let G be a Brownian bridge with covariance function κ(t, u) = f t (P)f u (P)dP (P) − f t (P)dP (P) f u (P)dP (P). Then, G n converges in distribution to G.
Proof. Since persistence landscapes are 1-Lipschitz, we have Now that we have shown that G n converges to a Gaussian process, we can follow the procedure outlined in Section 1.3. Let P n be the empirical measure that puts mass 1/n at each diagram P i . We draw P * 1 , . . . P * n from P n and construct the corresponding landscapes L * 1 , . . . , L * n . Let L * n be the empirical mean andθ * = sup t∈R | √ n(L * n (t) − L n (t))|. Repeating this B times, we obtainθ * 1 , . . .θ * B , and we compute the quantile q α .
Example 2.6 (Circles). Given the nine circles of radii 0.4 and 0.3, shown in Figure 3, we obtain a sample X 1 , . . . , X 100 as follows: first, choose a circle C i uniformly at random, then sample a point iid from C i . Let P be the (Betti 1) persistence diagram corresponding to the Rips filtration for the sample, and L be the landscape corresponding to P. 2 We repeat this 50 times to obtain diagrams P 1 , . . . P 50 and landscapes L 1 , . . . L 50 . Then, we use the bootstrap procedure to obtain the quantile q α = 0.234. Together with L 50 , this gives us an approximated 95% confidence band for µ(t) = E P (L i (t)). On the right of Figure 3 we show the empirical mean landscape L 50 with the 95% confidence band for µ(t).

Discussion
In this paper, we have described the bootstrap as it applies to persistence diagrams and landscapes. The purpose of this paper was to introduce the bootstrap and the bootstrap empirical process to topologists. In a related paper (Balakrishnan et al. [2013]), aimed towards a statistical audience, we derive the convergence rates for the technique presented in Section 2.1, as well as present three other methods for computing confidence sets for persistence diagrams.
The persistence landscape can be thought of as a summary function of a persistence diagram. The bootstrap method that we presented in Section 2.2 trivially generalizes to handle all landscapes L(k, t). Furthermore, we need not limit the scope of this method to landscape functions. In a future paper, we plan to investigate other meaningful summary functions as well as the convergence rates for the techniques presented in Section 2.2.
We have demonstrated how the bootstrap works for two examples, given in Figure 2 and Figure 3. Part of our ongoing research is investigating applications for these confidence intervals; in particular, we are applying it to real (rather than simulated) data sets. One can use the confidence intervals for hypothesis testing, but an open question is how to determine the power of such a test.