MC-UPGMA - Accurate hierarchical clustering for huge data

The MC-UPGMA (Memory Constrained UPGMA) framework allows accurate hierarchical clustering of huge data, i.e. too large to fit into memory for conventional clustering algorithms. Here we supply a C++ implementation of the multi-round algorithm described in [1], which we used to cluster the entire UniRef90 BLAST data reported therein (about 1.5 billion similarities, for 1.8 million non-redundant protein sequences).

The original clustering code for this package was contributed by Elon Portugaly, while the rest of the MC-UPGMA coding was done by Yaniv Loewenstein. The code is distributed under the GNU General Public License (GPL), and is intended for (but not restricted to) academic use. We hope that it would be instrumental in your research, but please note that the code is provided on an "as is" basis, and comes with no promise of support whatsoever.

Version 1.0.0

First release of multi-round MC-UPGMA code. download here

















[1] - Loewenstein Y, Portugaly E, Fromer M, Linial M. Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space. Bioinformatics 2008 24: i41-i49; Presented at ISMB 2008, Toronto.