An Efficient Hierarchical Clustering Algorithm for Large Datasets

Research Article

Austin J Proteomics Bioinform & Genomics. 2015;2(1): 1008.

An Efficient Hierarchical Clustering Algorithm for Large Datasets

Olga Tanaseichuk1,2*, Alireza Hadj Khodabakshi¹, Dimitri Petrov¹, Jianwei Che¹, Tao Jiang², Bin Zhou¹, Andrey Santrosyan¹ and Yingyao Zhou¹

1Genomics Institute of the Novartis Research Foundation, 10675 John Jay Hopkins Drive, San Diego,CA 92121, USA

2Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA

*Corresponding author: Tanaseichuk O, Genomics Institute of the Novartis Research Foundation,10675 John Jay Hopkins Drive, San Diego, CA 92121, USA

Received: September 24, 2014; Accepted: February 23, 2015; Published: February 25, 2015

Abstract

Hierarchical clustering is a widely adopted unsupervised learning algorithm for discovering intrinsic groups embedded within a dataset. Standard implementations of the exact algorithm for hierarchical clustering require O( n 2 ) MathType@MTEF@5@5@+=feaaguart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaam4tamaabmaabaGaamOBamaaCaaaleqabaGaaGOmaaaaaOGaayjkaiaawMcaaaaa@3C49@ time and O( n 2 ) MathType@MTEF@5@5@+=feaaguart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaam4tamaabmaabaGaamOBamaaCaaaleqabaGaaGOmaaaaaOGaayjkaiaawMcaaaaa@3C49@ memory and thus are unsuitable for processing datasets containing more than 20 000 objects. In this study, we present a hybrid hierarchical clustering algorithm requiring approximately O( n n ) MathType@MTEF@5@5@+=feaaguart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaam4tamaabmaabaGaamOBamaakaaabaGaamOBaaWcbeaaaOGaayjkaiaawMcaaaaa@3C6E@ time and O( n n ) MathType@MTEF@5@5@+=feaaguart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaam4tamaabmaabaGaamOBamaakaaabaGaamOBaaWcbeaaaOGaayjkaiaawMcaaaaa@3C6E@ memory while still preserving the most desirable properties of the exact algorithm. The algorithm was capable of clustering one million compounds within a few hours on a single processor. The clustering program is freely available to the research community at https://carrier.gnf.org/publications/cluster.

Keywords: Hybrid hierachical clustering; Hierachical clustering; k -means clustering; Large datasets

Introduction

Clustering is a popular unsupervised learning technique used to identify object groups within a given dataset, where intra-group objects tend to be more similar than inter-group objects. There are many different clustering algorithms [1], with applications in biocheminformatics and other data mining fields [2,3], including studies on protein families [4], functional genomics [2], chemical scaffolds [5], etc. In particular, clustering algorithms have been widely adopted in the bioinformatics fields after Treeview [6], a user-friendly visualization program, was made available following early studies on gene expression datasets.

Among all clustering methods, hierarchical clustering and k-means clustering are arguably the two most popular algorithms used due to their simplicity in result interpretation. In the cheminformatics field, Wards clustering [7] and Jarvis-Patrick clustering [8] are corresponding algorithms similar in spirit to hierarchical clustering and k-means clustering, respectively. Although there is no definitive answer as to which algorithm is more accurate, hierarchical clustering has been applied more often in bio-/cheminformatics research because of its deterministic property and flexibility in flattening the resultant tree at different cutoff levels.

However, applying hierarchical clustering to large datasets is rather challenging. First, compared to the linear complexity of the k -means algorithm, the most popular average-linkage hierarchical clustering requires O( n 2 ) MathType@MTEF@5@5@+=feaaguart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaam4tamaabmaabaGaamOBamaaCaaaleqabaGaaGOmaaaaaOGaayjkaiaawMcaaaaa@3C49@ time; we even observed O( n 3 ) MathType@MTEF@5@5@+=feaaguart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaam4tamaabmaabaGaamOBamaaCaaaleqabaGaaG4maaaaaOGaayjkaiaawMcaaaaa@3C49@ -time implementations in some popular bioinformatics tools [9]. Second, it requires O( n 2 ) MathType@MTEF@5@5@+=feaaguart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaam4tamaabmaabaGaamOBamaaCaaaleqabaGaaGOmaaaaaOGaayjkaiaawMcaaaaa@3C49@ memory [10], which limits the number of input data points to ~ 20 000 for a typical desktop computer. In bioinformatics research, functional genomics profiling data approaching this limit are routinely generated for the human genome [11]. In cheminformatics research, modern drug discovery applies ultrahigh- throughput screenings (uHTS) for several million compounds in one experiment [12]. Two problems arise from uHTS. First, to expand the screening compound collection, vendor catalogs of millions of compounds ideally should be hierarchically clustered and prioritized for acquisition. But in practice, informaticians resort to a greedy algorithm such as Sphere Exclusion [13], which relies on a predetermined similarity threshold. Second, instead of analyzing all compound profiles across a panel of screening assays, hierarchical clustering analyses have usually been compromised and restricted to ~ 20 000 top screening hits due to memory limitations. Therefore, there exists a significant need to develop a hierarchical clustering algorithm for large datasets.

Approximating hierarchical clustering in subquadratic time and memory has been previously attempted [14-18]. However, these methods either rely on embedding into spaces that are not biologically sensible, or they produce very low resolution hierarchical structures. Our goal is to produce hierarchical results with the same resolution as the exact hierarchical method, although with less accuracy, while maintaining the bio/chemically meaningful distance metrics. For a dataset over 20 000 objects, we are limited by both O( n 2 ) MathType@MTEF@5@5@+=feaaguart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaam4tamaabmaabaGaamOBamaaCaaaleqabaGaaGOmaaaaaOGaayjkaiaawMcaaaaa@3C49@ memory and time. Therefore, a reasonable approximation needs to be introduced. We observe that if an exact hierarchical tree has been constructed, one can set a similarity cutoff such that tree branches above the cutoff are distant enough from each other and represent the coarse clusters of the dataset. The branches and leaves below the cutoff represent hierarchical structures within small-scale local vicinities. For a large dataset, we are often initially interested in a “zoomed out” view of the coarse clusters, then “zoom in” to neighborhoods of interest for a finer view of the intra-group structures. The two views are often considered to be the most beneficial properties of the hierarchical clustering. For example, in the aforementioned compound requisition problem, one would cherry pick vendor compounds from the coarse neighborhood if only a small number of compounds can be selected for purchasing. When budget and logistics permit, one could then lower the cutoff to pick more compounds within interesting coarse clusters.

To capture both distant and close views of the hierarchical structure for a large dataset, we propose a hybrid hierarchical clustering algorithm (Figure 1). Initially, the n objects are clusteredby a k -means clustering algorithm, where k is chosen to be reasonably large, into roughly k coarse neighborhoods. We then apply the exact hierarchical clustering algorithm to cluster the k centroids into a coarse tree, as well as to the objects within each of the k clusters into k detailed trees. By replacing the k centroids in the coarse tree by the corresponding detailed trees, this two-step hybrid algorithm assembles a complete tree of n objects that can be cut, i.e., zoomed in and zoomed out, at various levels. The number k can be selected by the user and controls the cutoff reflecting the average similarities of objects within each coarse neighborhood. Practically, we cannot reliably distinguish data points positioned closer than the magnitude of the intrinsic noise of the data. Therefore, these data could be treated as one aggregated data object without losing meaningful interpretation of the data set. If the value of k is large enough, the size of individual k-means clusters approaches the intrinsic noise, and the k-means clustering step retains most of the essential information, thus the tree resulted from the hybrid algorithm could be considered as accurate as the one resulted from the exact AHC algorithm. If optimized for clustering speed, k~ n MathType@MTEF@5@5@+=feaaguart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaam4AaiaaykW7caGG+bWaaOaaaeaacaWGUbaaleqaaaaa@3C91@ can be chosen to yield an approximate running time of O( n n ) MathType@MTEF@5@5@+=feaaguart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaam4tamaabmaabaGaamOBamaakaaabaGaamOBaaWcbeaaaOGaayjkaiaawMcaaaaa@3C6E@ and storage of O( n n ) MathType@MTEF@5@5@+=feaaguart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaam4tamaabmaabaGaamOBamaakaaabaGaamOBaaWcbeaaaOGaayjkaiaawMcaaaaa@3C6E@ as discussed later in detail.