The Elusive Shape of Big Data
Mentor:John Arlo Caine, Assistant Professor of Math, California State Polytechnic University Pomona
Experiments tightly control variables and yield low-dimensional data by design, but with the growth of information technology has come the burgeoning of “big data”, which is complex and high-dimensional and resistant to traditional analysis techniques. The nascent field of computational topology promises to pick up this slack but is being held up by inefficient algorithms. We report on this and our new implementation of tidy set, an algorithm for the efficient computation of homology for clique complexes. A clique complex is built from geometric shapes called simplices in order to approximate a given data set, and the homology for this complex then reveals the given data's fundamental shape. To speed up the computation of homology, tidy set draws on category theory to reduce a clique complex to a bare-bones simplicial set without corrupting its homology. This is done by selectively deleting and collapsing simplices. The resulting drop in time and memory required to compute it allows homology analysis to be applied to much larger data sets. Future work would extend these ideas to similarly reduce the time and memory required to compute homology for a family of clique complexes parameterized by a distance $\varepsilon$, thus enabling wider use of the more sophisticated persistence homology.