-9.6 C
United States of America
Saturday, February 22, 2025

Scaling hierarchical agglomerative clustering to trillion-edge graphs


In a line of labor beginning in 2021, we rigorously investigated the complexity and parallel scalability of HAC, from a sparse graph-theoretic perspective. We noticed that in observe, solely m << n² of essentially the most comparable entries within the similarity matrix have been truly related for computing high-quality clusters. Specializing in this sparse setting, we got down to decide if we might design HAC algorithms that exploit sparsity to realize two key targets:

  1. numerous computation steps proportional to the variety of similarities (m), reasonably than n², and
  2. excessive parallelism and scalability.

We began by rigorously exploring the primary objective, since attaining this could imply that HAC may very well be solved a lot sooner on sparse graphs, thus presenting a path for scaling to extraordinarily giant datasets. We proved this risk by presenting an environment friendly sub-quadratic work algorithm for the issue on sparse graphs which runs in sub-quadratic work (particularly, O(nm½) work, as much as poly-logarithmic components). We obtained the algorithm by designing a cautious accounting scheme that mixes the traditional nearest-neighbor chain algorithm for HAC with a dynamic edge-orientation algorithm.

Nonetheless, the sub-quadratic work algorithm requires sustaining an advanced dynamic edge-orientation information construction and isn’t straightforward to implement. We complemented our precise algorithm with a easy algorithm for approximate average-linkage HAC, which runs in nearly-linear work, i.e., O(m + n) as much as poly-logarithmic components, and is a pure leisure of the grasping algorithm. Let ε be an accuracy parameter. Then, extra formally, a (1+ε)-approximation of HAC is a sequence of cluster merges, the place the similarity of every merge is inside an element of (1+ε) of the very best similarity edge within the graph at the moment (i.e., the merge that the grasping precise algorithm will carry out).

Experimentally, this notion of approximation (say for ε=0.1) incurs a minimal high quality loss over the precise algorithm on the identical graph. Moreover, the approximate algorithm additionally yielded giant speedups of over 75× over the quadratic-work algorithms, and will scale to inputs with tens of tens of millions of factors. Nonetheless, our implementations have been gradual for inputs with quite a lot of hundred million edges as they have been completely sequential.

The following step was to aim to design a parallel algorithm that had the identical work and a provably low variety of sequential dependencies (formally, low depth, i.e., longest crucial path). In a pair of papers from NeurIPS 2022 and ICALP 2024, we studied learn how to receive good parallel algorithms for HAC. First, we confirmed the widespread knowledge that precise average-linkage HAC is difficult to parallelize as a result of sequential dependencies between successive grasping merges. Formally, we confirmed that the issue is as exhausting to parallelize as another drawback solvable in polynomial time. Thus, barring a breakthrough in complexity principle, average-linkage HAC is unlikely to confess quick parallel algorithms.

On the algorithmic facet, we developed a parallel approximate HAC algorithm, known as ParHAC, that we present is very scalable and runs in near-linear work and poly-logarithmic depth. ParHAC works by grouping the sides into O(log n) weight lessons, and processing every class utilizing a rigorously designed low-depth symmetry-breaking algorithm. ParHAC enabled the clustering of the WebDataCommons hyperlink graph, one of many largest publicly out there graph datasets with over 100 billion edges, in just some hours utilizing a reasonable multicore machine.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles