No news is good news. MCL is very stable and used in many places. The latest release is mcl-14-137. A feature added in that release is the ability to subcluster a given clustering. This can be done by given mcl the name of a cluster file (as output by mcl) with the new -icl option. Edges in the input graph that cross clusters will be removed before using the MCL cluster algorithm to compute the subclustering.
This paper, Improving the Quality of Protein Similarity Network Clustering Algorithms using the Network Edge Weight Distribution, by Apeltsin et al in Bioinformatics, discusses the effect of edge weight thresholding on the performance of clustering algorithms. It is an independent comparison of a number of clustering algorithms, including MCL.
Distances between clusterings
The split/join distance (a distance between partitions, comparable to the (adjusted) Rand index, the variance of information, and the Mirkin distance), was compared to sixteen other distances and indexes in this paper from 2009, Adapting the right measures for K-means clustering, Wu et al, Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, p877-886 The context is that of finding appropriate validation measures for K-means clustering when using a gold standard (a priori desired partitioning). They recommend the use of both the normalised split/join distance (they call it, alas, the 'van Dongen criterion') and the normalised Variation of Information distance by Meila. These two are what I have always recommended as well. They also recommend the Rand index (which is equivalent to the Mirkin distance) as a 'complementary measure', but my cynical side thinks this is just to accommodate common acceptance of this index. The Rand index is influenced by cluster size in dramatic and unfortunate ways, as explained in my answer to this stackexchange question.
Pinkhas Nisanov has implemented MCL for Perl using the Perl Data Language (PDL) in Algorithm::MCL. Use of PDL implies that the underlying matrix operations use natively compiled code rather than Perl bytecode; that is, it is fast.
With Cei Abreu-Goodger I wrote a book chapter Using MCL to extract clusters from networks. It is part of the book Bacterial Molecular Networks, edited by Jacques van Helden, Ariane Toussaint and Denis Thieffry in the series Methods in Molecular Biology. I am very pleased with how chapter and book have turned out, especially with the case studies in our chapter. The first is an analysis of the proteomes of 28 fully sequenced genomes of the order Rickettsiales. These are intracellular alphaproteo-bacteria with complex life cycles and highly streamlined genomes. The second is an analysis of E. Coli gene expression across 466 different experimental conditions. In both studies it is shown that clusterings can be informative at different levels of granularity.
For a long time now MCL has shipped with many siblings in what is intended to be a coherent set of network analysis tools. What this collection has lacked is some way of refering to it without creating confusion. This void is now filled, and MCL-edge is the new name. The unadorned name MCL will not go away, and the source code will still be shipped using a name of the format mcl-YY-DDD. MCL-edge is hopefully going to be instrumental in giving this web-site a better structure. It is also the central topic of an application note submitted today, with the title MCL-edge: Analysis of networks with millions of nodes.
Comments on MCL. Community detection in graphs (S Fortunato, Physics Reports, 2010) is a very (jaw-droppingly) comprehensive and well-written survey of clustering methods for graphs (in the physics/social network communities clustering is commonly refered to as community detection, hence the title). I can wholeheartedly recommend this article as a beautifully written and thorough review of the field. That said, I feel compelled to annotate two comments from it that concern MCL. They are:
A problem of the method is the fact that the final partition is sensitive to the parameter α used in the inflation step. Therefore several different partitions can be obtained, and it is not clear which are the most meaningful or representative.
The method is really simple to implement, which is the main reason of its success: as of now, the MCL is one of the most used clustering algorithms in bioinformatics.
In response, I have always seen the inflation parameter (α in the first quotation) as one of the strenghts of MCL. It is the only parameter, and it controls cluster granularity without requiring or even allowing the number of clusters to be specified. Cluster structure can be present and make sense at different scales, so I consider this a feature, not a problem. It is true that the inflation parameter requires a choice to be made, and hence effort. This is true for many more methods, and it is the cost attached to choice.
The second quotation is a bit peculiar. I do not know the reasoning upon which the conclusion (popular) is drawn from the premiss (easy to implement). First, there are many algorithms that are simple to implement, yet not as popular. Second, I propose that simplicity in itself is a good thing. The particular simplicity of MCL is also its beauty, namely that cluster structure is found as the imprint of an algebraic process with demonstrable emergent properties. Hence, simplicity and popularity are conceivably both correlated with a third trait, fitness. Third, I believe that a factor firmly in favour of MCL's popularity was and is simply the presence of a finished software implementation, freely available for use. Hundreds of algorithms are published each year, but hardly any of them come with implementations. Noteworhty exceptions include Restricted Neighbourhood Search Clustering (RNSC), the Louvain method, Affinity Propagation Clustering, and indeed MCL.
Faster mcl. The current release 10-201 has changes that considerably speed up MCL. The first is that the memory management code has been modified. This has solved cases where mcl inexplicably failed to progress for graphs of very large size. A more extensive description can be found here. The second change is that mcl now utilises vanilla matrix/vector multiplication where this is faster than sparse matrix/vector multiplication. It chooses between an O(k2 + N) algorithm with very low hidden constant and an O(k2) algorithm with significantly higher hidden constant, using the algorithm with least expected run-time. Here k is a constant that bounds the number of resources allocated to a node in the graph.
The speed gains range from slightly below twofold to sixfold depending on graph size and edge density. Sixfold speed increases may be obtained for graphs of size up to several hundred thousands of nodes. At the upper range a graph was tested with 2,668,573 nodes and 633,075,684 arcs (316,537,842 edges). It took 3.5 hours wall clock time on a system with 4 CPUs, compared to 5.25 hours wall clock time for previous releases, amounting to a 1.5-fold speed increase.
mcl-10-148 released. Notable changes include a slightly faster MCL, a much improved mcxarray (for creating networks from tabular data), improved parallelisation of mcxarray, mcx diameter, mcx ctty (centrality) and mcx clcf (clusterinf coefficient). A host of generalized graph transformation options has been added to many programs all using the exact same specification language. mcx query has a new option --node-attr to compute simple summary statistics of network node properties.
mcl-09-261 released. The previous release introduced a bug in the clustering interpretation code. If a clustering contained overlap (a rare event) mcl would, while splitting this overlap, separate all clusters not involved in overlap into singletons. The bug, now fixed, was reported by Tao Yue. The default action for mcl is to remove overlap; use -overlap keep to keep it.
Affinity propagation clustering and the Markov Cluster algorithm have finally been compared, in Markov clustering versus affinity propagation for the partitioning of protein interaction graphs, an article by Jim Vlasblom and Shosana Wodak in BMC Bioinformatics.
As this is a first comparison it should not be taken as a final verdict, and it must be considered that different algorithms may have different strengths and weaknesses and different areas of applicability. That said, the results from the above paper favour MCL.
Our analysis shows that the MCL procedure is significantly more tolerant to noise and behaves more robustly than the AP algorithm. The advantage of MCL over AP is dramatic for unweighted protein interaction graphs, as AP displays severe convergence problems on the majority of the unweighted graph versions that we tested, whereas MCL continues to identify meaningful clusters, albeit fewer of them, as the level of noise in the graph increases. MCL thus remains the method of choice for identifying protein complexes from binary interaction networks.
Incidentally, there is a long list of papers in which clustering methods are proposed and validated (and shown to be superior) by comparing with, among others, MCL. In many cases I have doubts about the comparison methodology that was used, but the list is simply too long to attempt analysis. I will, however, gladly list and analyse any comparison performed by independent researchers, whatever the conclusion may be.
mcxdeblast in the 08-157 release contains a bug, and a new 08-312 release has been made. It will probably be followed up by a more thoroughly vetted release in the coming weeks. You can also just download a patched version of mcxdeblast.
mcl-08-157 released. From the Freshmeat announcement:
The mcl suite is moving towards a wider focus on general purpose large scale graph analysis, with the emphasis, besides clustering, on basic graph and clustering measures and transformations. The program mcxarray can now transform tabular gene expression data into graph input. The utility clm computes clustering coefficients, diameter and eccentricity, and betweenness centrality. Many fixes and improvements were made throughout.
A longer list of changes is given in the ChangeLog.
MCL was used in the April 2008 Nature publication Broad phylogenomic sampling improves resolution of the animal tree of life, and this was subsequently reported by the AMS math in the media column.
Ravi Tharakan first ran into trouble, then solved the issues, when installing MCL on Cygwin under Windows Vista. One issue is to set the 'Default text type' option to 'DOS/ text' instead of 'Unix / binary' when you install cygwin — then everything works fine, and you don't need to use d2u (the latter, d2u, is a utility to convert DOS line endings to Unix line endings). Another is to make sure that the Cygwin dll can be found via the PATH variable. If necessary, augment the PATH variable to include the path where cygwin1.dll can be found.
My article Graph Clustering Via a Discrete Uncoupling Process was published (SIAM Journal on Matrix Analysis and Applications, Volume 30 Issue 1, pp. 121-141). It largely corresponds with the second technical report referenced below, except that it contains less material and benefited from many useful reviewer comments. Why was it published sooo long after my PhD thesis was published? The answer is a story of persistence and abandonment, alternating, repeatedly.
This clustering method was published in Science: Clustering by Passing Messages Between Data Points, Brendan J. Frey and Delbert Dueck, Science 16 February 2007, Vol. 315. no. 5814, pp. 972 - 976. The authors make a number of strong claims, including
.. Affinity propagation found clusters with much lower error than other methods, and it did so in less than one hundredth the amount of time.
However, in their paper they mainly compare with the k-centers algorithm, which seems scant evidence to support this claim. Interestingly there are very clear conceptual similarities between APC (affinity propagation clustering) and MCL as detailed below. It will be interesting to see how APC and MCL compare, preferably by unbiased people (unlike myself) similar as in this paper comparing cluster methods.
Their work starts from a message passing paradigm (the sum product algorithm) that generalizes a number of other algorithms, such as the Viterbi algorithm, Pearl's belief propagation algorithm, the iterative turbo decoding algorithm, and certain fast Fourier transform algorithms. The APC algorithm is different in that it incorporates loops, which makes it somewhat less amenable to a formal and rigorous analysis. Interestingly, just from the formulation of the algorithm, it possesses strong similarities to MCL. Its responsibility and availability update rules seem to represent reinforcement and recombination updates similar to the operations of inflation and expansion in MCL. In this respect both algorithms have a similar feel and can be classified as emergent algorithms.
When looking at a visualization of APC the resemblances with MCL are quite striking as seen from this MCL animation. It would be nice to see these two methods compared.
Evaluation of clustering algorithms for protein-protein interaction networks, Sylvain Brohee and Jacques van Helden, BMC Bioinformatics 2006, 7:488. This paper contains a four-way comparison of Markov Clustering (MCL), Restricted Neighborhood Search Clustering (RNSC), Super Paramagnetic Clustering (SPC), and Molecular Complex Detection (MCODE). Among the conclusions are that MCL is robust to graph alterations and that it performs favourably at the task of extraction of complexes from interaction networks.
In the last Ensembl protein families analysis mcl was run on a graph built on 1008387 nodes (proteins). The process took a combined CPU time of 23 hours on 8 processors and ran in 7 hours wall clock time. Reading the matrix from the 1.7G file probably took a substantial amount of time.
As of mcl-06-058 a very small perl mcl implementation called minimcl is shipped with the mcl distribution. Analyze it and tinker with it to gain a better understanding of mcl. minimcl is fully functional but of course very slow.
As of mcl-05-314 MCL is able to read in label input directly and return the clustering in terms of those labels. A very small example is the file cathat below.
It can be clustered like this:
The file out.cathat should now like like this