This site is devoted to mathematics and its applications. Created and run by Peter Saveliev.

The topology of data

From Mathematics Is A Science
Jump to navigationJump to search

Project by Joseph Snyder, supervised by Peter Saveliev

Introduction

Suppose we have conducted 1000 experiments with a set of 100 various measurements in each. Then each experiment is a string of 100 numbers, or simply a vector of dimension 100. The result is a collection of disconnected 1000 points, called the point cloud, in a 100-dimensional vector space.

It is impossible to visualize this data as any representation that one can see is limited to dimension 3. Yet we still need to answer a few simple topological questions about the object behind the point cloud:

  • Is it one piece or more?
  • Is there a tunnel?
  • Or a void?
  • And what about possible 100-dimensional topological features?

Through clustering (and related approaches) statistics answers the first question. This is a common topological approach to the problem. For a point cloud in a euclidean space, suppose we are given a threshold r so that any two points within r from each other are to be considered "close". Then each pair of such points is connected by an edge. If three points are “close”, we add a face, etc. The result is a simplicial complex that approximates the manifold M behind the point cloud. More: Topological data analysis.

Abstract

The recent advancements in data collection have lead to a tremendous growth of multidimensional data. With more advanced data comes the difficult task of analyzing the data to determine its significant features as the data is impossible to visualize. My research is focused on improving computational methods of analysis of data presented via point clouds. Working with the program jPlex developed by the computational topology group at Stanford, we have developed a method for determining persistent relative homology of these point clouds and use it to compute the dimension of the datasets.

Objectives

The overall goal is to develop a computer program that determines the topological features of multidimensional geometric figures that represent data via point clouds. Working with JPlex, we have achieved this goal by:


Background

Betti Numbers: B0 = 1 (parts), B1 = 2 (tunnels), B2 = 1 (voids). Notice that the two tunnels are captured by the red and blue cycles. The cycles are not able to be contracted to a single point and as such capture the tunnels. The interior of the tire is the void we are considering in B2.

We will compute the spacial properties that are preserved under topological transformations, i.e. you can bend, stretch, or shrink but not tear or glue. We determine topological features: connected components, tunnels, voids, etc of the simplicial complex. This information is given by the Betti numbers.

The Vietoris-Rips complex is a simplicial complex constructed from a point cloud by making each set of k points within a given distance ϵ form a (k-1)-simplex:

  • 2 points within ϵ form an edge,
  • 3 points, every pair which is within ϵ, form a triangle,
  • 4 points, each pair of which is within ϵ, form a tetrahedron.

The question becomes: How do we know which value of ϵ to use? The answer is to vary ϵ and construct a simplicial complex for each value of ϵ. Then, we display barcodes that represent the lifespans of the topological features. The length of the lifespan is also known as persistence. It measures the significance of the topological feature.

Method

Drawing inspiration from [2], we set out to investigate the program Jplex. We are working with the most up to date version of jPlex. The software is available to download from CompTop [1] at Stanford University.

Initially, we confirmed the accuracy of JPlex by analyzing known data sets to determine their persistent Betti numbers. Once we understood how JPlex worked [1], we formulated a method for calculating persistent relative homology. We developed a JAVA based program that can be run in the JPlex window that constructs a distance matrix from the point cloud. In order to compute the relative homology we set all points farther than a chosen distance from the center to a single point. For example, a line turns into a circle, a plane into a sphere, etc (see Quotient spaces). With persistent Betti numbers and persistent relative homology, we determine the dimension of the data set. In order to speed up this computation we condensed non-essential data when analyzing point clouds for persistent relative homology.

Copy plex.jar and the necessary files to your home directory such as My Documents. In a command window, change the folder to the one of jPlex, such as:

 >cd documents/jplex

Then:

 java -cp plex.jar JPlex

(watch the caps). jPlex window opens:

 >plex

Then type:

 >pdata = Plex. EuclideanArrayData("fullsquaredata.txt");
 >parameter = 25;

That's the radius.

 >bg("distancematrix.bsh");

Or "distancematrix3.bsh" etc. Then

 >bg("distances.bsh");
 >pdatarelative = Plex.DistanceData(distances);
 >ripsrelative = Plex.RipsStream(0.01,3,25,pdatarelative);
 >intervals = Plex.Persistence().ComputeInterval(ripsrelative);
 >Plex.Plot(intervals,"ripsrelative",25);

Persistent Betti Numbers

Using JAVA, we developed data sets that are well understood, made them noisy, and applied JPlex to these sets. A few examples that were analyzed include:

 N-Spheres with N ranging from 1 to 3
 Lines
 Squares
 Planes

Overall, as long as the data was somewhat recognizable, the expected persistent Betti numbers were discovered by JPlex.

Dimension of a Data Set

Below is an example of how we determine the dimension of a point cloud using our program and tools already included in JPlex.

A plot of 3-dimensional point cloud.jpg

Barcodes of Betti numbers.jpg

In the above barcodes, there is one feature in dimension 0 that has a much longer lifespan than other dimension 0 features. In dimension 1, no single feature persists significantly.

Barcodes of the relative Betti numbers.jpg

Analyzing the above barcodes, we see that in dimension 0, one feature persists longer that other dimension 0 features. In dimension 1, no single feature has a particularly long lifespan compared to other dimension 1 features. In dimension 2, one features has a significant lifespan.

By analyzing the barcodes of relative Betti numbers, we determine the dimension of our data set. In this example, the persistent Betti numbers are:

 B0 = 1, B1 = 0, B2 = 0, …

Thus the data set constitutes a single part, and has no other topological features. Next, the persistent relative Betti numbers are:

 B0 = 1, B1 = 0, B2 = 1, B3 = 0, …

It follows that the data set is actually a two-dimensional plane.

Issues with JPlex

The method we have developed still has flaws; however, with further research one can improve upon this method. Listed below are a few areas of concern:

  • JPlex only allows analysis up to dimension 7.
  • The current exterior program counts points as non-essential only if there are two points in a row that are outside of the given distance parameter.
  • JPlex has memory issues with relatively small data sets when analyzing higher dimensions (dimension four and up).

Conclusions

  • By computing persistent Betti numbers and persistent relative homology, one can better understand the topology of data.
  • Using a distance matrix, it is fairly simple to determine relative persistent homology with JPlex.
  • Using JPlex in companion with our program, we determine the dimension of a data set.
  • Further research is required to demonstrate the advantages of this method over PCA and other traditional methods.


Proposed experiment

Create a digital geometric figure with no symmetries, create the dataset of all photos (or just the projections) from all possible directions, run the algorithm -- it's supposed to discover the rotation group SO(3): $$H_1(SO(3);{\bf Z})={\bf Z}_2.$$

References

  1. The tutorial is written by Henry Adams, a graduate student at Stanford University and is available for both the Standalone version [2] and the MATLAB version [3] of jPlex.
  2. Topology and Data by G. Carlsson
  3. Topological data analysis
  4. Introductory algebraic topology: course

Appendix

See also: