| Software | Download | Utilities | Modules | Packaged MCL | C/Java/Perl/Matlab |
| Documentation | Manual | FAQ | Other manuals | BLAST | |
| Writings | MCL publications | Introduction I | Introduction II | ||
| Visualization | MCL animation | Pictorial | |||
| Contact | Contact Stijn | ||||
The MCL algorithm is short for the Markov Cluster Algorithm, a fast and scalable unsupervised cluster algorithm for graphs based on simulation of (stochastic) flow in graphs. The algorithm was invented/discovered by Stijn van Dongen (that is, me) at the Centre for Mathematics and Computer Science (also known as CWI) in the Netherlands. The PhD thesis Graph clustering by flow simulation is centered around this algorithm, the main topics being the mathematical theory behind it, its position in cluster analysis and graph clustering, issues concerning scalability, implementation, and benchmarking, and performance criteria for graph clustering in general. The work for this thesis was carried out under supervision of Jan van Eijck and Michiel Hazewinkel. The thesis, technical reports, and preprints can be found in the reading section.
Applications
MCL has been applied in a number of different domains.
Currently the number of papers citing core MCL publications is
well over two hundred. Get a quick impression from Google Scholar
for the Enright/van Dongen/Ouzounis paper
or my thesis.
Other potential applications that come to mind are (computational) chemistry
and multi-level approaches in graph partitioning (c.q. VLSI design).
My implementation of the MCL algorithm is available from this page for download. Several other utilities for manipulating graphs, matrices, and clusterings, come with it. Manual pages in various formats are available. For quickly getting an idea of how the MCL algorithm operates, take a look at this MCL flow pictorial in the reading section, or even better, have a look at an animation of the MCL process.
The MCL algorithm is very fast, very scalable, and has a number of attractive properties causing it to deliver high-quality clusterings.
| • | The MCL algorithm simulates flow using (alternating) two simple algebraic operations on matrices. Its formulation is simple and elegant. There are no high-level procedural instructions for assembling, joining, or splitting of groups - cluster structure is bootstrapped via a flow process that is inherently affected by any cluster structure present. The first operation used by MCL is expansion, which coincides with normal matrix multiplication. Expansion models the spreading out of flow, it becoming more homogeneous. The second is inflation, which is mathematically speaking a Hadamard power followed by a diagonal scaling. Inflation models the contraction of flow, it becoming thicker in regions of higher current and thinner in regions of lower current. The MCL process causes flow to spread out within natural clusters and evaporate inbetween different clusters. This animated example of an MCL process may give you an impression of its modus operandi. | |
| • | By varying parameters, clusterings on different scales of granularity can be found. The number of clusters can not and need not be specified in advance, but the algorithm can be adapted to different contexts. | |
| • | The issue 'how many clusters?' is not dealt with in an arbitrary manner, but rather by strong internal logic. Cluster structure leaves its marks on the flow process simulated by the algorithm, and the flow parameters control the granularity of the cluster imprint. | |
| • | The limit of the MCL process (the process simulated by the algorithm) is in general extremely sparse, and the iterands are sparse in a weighted sense. This gives the means to scale the algorithm drastically, leading to a worst-case complexity of order Nk^2, where N is the number of nodes of the input graph, and where k is a threshold for the number of resources allocated per node. | |
| • | The rate of convergence of the MCL process, and projection of the iterands afterwards onto the resulting clustering, give hooks for unsupervised parameter adjustment. | |
| • | The iterands of the MCL process have structural properties which allow a cluster interpretation, and which generalize the mapping of MCL limits onto clusterings. The mathematics associated with the MCL process shows that there is an intrinsic relationship between the MCL process and cluster structure in graphs. This is very valuable given the many heuristic approaches in cluster analysis. |
For a full description of the MCL algorithm and process you are advised to read one of the technical reports in the reading section. It is also possible to view a somewhat longer introduction or an introduction to some of the mathematics associated with MCL.
The basic interface to the MCL algorithm is very simple - you need only one option (the -I flag) to get to the heart of it, and for large graphs you should also be aware of the the -scheme flag for regulating resources. The default approach is to vary the argument to -I over some interval (doing an mcl run for each value), and analyze the clustering output with the other utilities that come with MCL (cf the mcl manuals).
| • | 05 Jun 08 The mcl suite is moving towards a wider focus on general purpose large scale graph analysis, with the emphasis, besides clustering, on basic graph and clustering measures and transformations. The program mcxarray can now transform tabular gene expression data into graph input. The utility clm computes clustering coefficients, diameter and eccentricity, and betweenness centrality. Many fixes and improvements were made throughout. A longer list of changes is given in the ChangeLog. | |
| • | 10 May 08 | |
| • | 19 March 08 | |
| • | 20 February 08 | |
| • | 2 March 07 .. Affinity propagation found clusters with much lower error than other methods, and it did so in less than one hundredth the amount of time. However, in their paper they mainly compare with the k-centers algorithm, which seems scant evidence to support this claim. Interestingly there are very clear conceptual similarities between APC (affinity propagation clustering) and MCL as detailed below. It will be interesting to see how APC and MCL compare, preferably by unbiased people (unlike myself) similar as in this paper comparing cluster methods. Their work starts from a message passing paradigm (the sum product algorithm) that generalizes a number of other algorithms, such as the Viterbi algorithm, Pearl's belief propagation algorithm, the iterative turbo decoding algorithm, and certain fast Fourier transform algorithms. The APC algorithm is different in that it incorporates loops, which makes it somewhat less amenable to a formal and rigorous analysis. Interestingly, just from the formulation of the algorithm, it possesses strong similarities to MCL. Its responsibility and availability update rules seem to represent reinforcement and recombination updates similar to the operations of inflation and expansion in MCL. In this respect both algorithms have a similar feel and can be classified as emergent algorithms. When looking at a visualization of APC the resemblances with MCL are quite striking as seen from this MCL animation. It would be nice to see these two methods compared. | |
| • | 13 Nov 06 | |
| • | 12 Sep 06 | |
| • | 27 Feb 06 | |
| • | 11 Nov 05 ---8<------8<------8<------8<------8<--- cat hat 0.2 hat bat 0.16 bat cat 1.0 bat bit 0.125 bit fit 0.25 fit hit 0.5 hit bit 0.16 --->8------>8------>8------>8------>8--- It can be clustered like this: mcl cathat --abc -o out.cathat The file out.cathat should now like like this ---8<------8<------8<------8<------8<--- cat hat bat bit fit hit --->8------>8------>8------>8------>8--- A number of examples that involve clustering similarity graphs encoded in BLAST results are found further below. |
None of the publications referenced on this page uses the phrase emergent algorithm to describe MCL. The criteria listed in the slim Wikipedia entry do all apply to MCL however.
| • | it achieves predictable global effects | |
| • | it does not require global visibility | |
| • | it does not assume any kind of centralized control | |
| • | it is self-stabilizing |
This connection does not suddenly further our understanding of the MCL algorithm in new ways, but it is certainly more profound than a lazy buzzword flip. Emergent algorithms have a place in science, and the listed criteria are important characteristics of MCL. The algorithm can be implemented in a distributed network, if applied to sparse graphs. This is implied by the locality of the operations employed and embodies the second criterion.
An important characteristic of MCL that is perhaps not common in the class of emergent algorithms is that it is fully deterministic and to some extent amenable to mathematical analysis.
If you use this software in writing scientific papers, or you use this software in any other medium serving scientists or students (e.g. web-sites, CD-ROMs) include proper reference to mcl's home on http://micans.org/mcl/ and its author, Stijn van Dongen. Also include at least one of the following citations - note that copies of the corresponding publications can be retrieved via the links given.
| • |
Stijn van Dongen, Graph Clustering by Flow Simulation.
PhD thesis, University of Utrecht, May 2000. | |
| • |
Stijn van Dongen. A cluster algorithm for graphs.
Technical Report INS-R0010, National Research Institute for Mathematics and
Computer Science in the Netherlands, Amsterdam, May 2000. | |
| • |
Stijn van Dongen. A stochastic uncoupling process for graphs.
Technical Report INS-R0011, National Research Institute for Mathematics and
Computer Science in the Netherlands, Amsterdam, May 2000. | |
| • |
Stijn van Dongen. Performance criteria for graph clustering and Markov
cluster experiments. Technical Report INS-R0012, National Research
Institute for Mathematics and Computer Science in the Netherlands,
Amsterdam, May 2000. | |
| • | Stijn van Dongen. Graph clustering via a discrete uncoupling process. Siam Journal on Matrix Analysis and Applications 30-1, p121-141, 2008. |
For biological applications, it is appropriate to cite in addition the following paper, in which the application of MCL to biological graphs was first proposed.
| • | Enright A.J., Van Dongen S., Ouzounis C.A. An efficient algorithm for large-scale detection of protein families, Nucleic Acids Research 30(7):1575-1584 (2002). |
Download a a ready-to-install tarball containing my GPL-ed MCL implementation in C. In case you wonder, GNU GPL stands for GNU General Public License. It implies that the source code is available, and you have the freedom to modify the code to your needs. If you pass parts or all of the code, compiled parts of the code, or a derived product on to others, you have to make the code available to them on the same conditions. For more information about GNU software and the GNU philosophy go hither.
If you start actually using the software, I would very much appreciate feedback, particularly about how the MCL algorithm works for you.
Manual pages in html, troff, and ps are part of the distribution (and can be automatically installed), and they are also available from this site. Several other utilities are part of the mcl distribution, they are described below.
Many thanks go to Joost van Baal who packages this mcl implementation for Debian and autofooled MCL. This implies that the tarball pointed to above should install on virtually any UNIX flavour.
The software is provided on an 'as is' basis. The implementation is fast (one of the aims in its design), and is grounded on a lean sparse matrix/vector architecture. It is command-line based and the source code documentation is also lean and sparse.
There is a meta manual page listing the different applications and their use.
LIGHTWEIGHT
As of the 05-314 release clustering from BLAST results has been made very
easy by mcl's newly acquired ability to read in label-type input.
This page contains several examples and a few more are found
in the mcl manual.
CLASSIC
The BLAST module is the first of what should become a collection of
pluggable modules, for bioinformatics as well as for other application
fields. It is basically a very small wrapper (mclblastline) in which a
dedicated parser (mcxdeblast) is integrated with mcxassemble, mcl, and
clmformat. Of course you need not necessarily use the pipeline script;
simply using the constitutent applications should be straightforward. In
this respect, the mclblastline option --whatif is very useful. The
module is enabled by adding --enable-blast to the configure options,
and manual pages for mclblastline, mcxdeblast, mcxassemble and mclpipeline
(implementing the generic mcl pipeline) are shipped with the distribution.
Those manual pages can be viewed here as well; find them on the
index of manual pages.
You are advised to no longer use the TribeMCL module, as it has been superseded by the approaches given above. For biological applications it is of course appropriate to cite the Enright/van Dongen/Ouzounis paper as indicated in the download section.
Debian mcl package created by Joost van Baal. Kari Pahula backported mcl-06-021 to Debian stable.
OpenBSD mcl package created by Andreas Kähäri. (I am not sure if the given URL is the best available link, but openBSD people surely know their way around).
Dan Thomasset has created source and binary i386 RPMs for mcl-06-058. Retrieve the i386 RPM and the source RPM from the Stowers Institute. You can also get the spec file created by Dan.
The implementation available from this page is written in C and aims for maximum speed and scalability. The next release or the one after that will feature mcl as a C library call including an interface for building the input graph. This should make the implementation available for use within Java with JINI.
minimcl is written in perl and solely exists for educational purposes.
Gregor Heinrich has written a Java implementation of mcl. The zip file includes a 50-line mcl implementation in matlab, useful as a testbed and proof of concept. Find documentation here.
This section documents a lighweight close-to-the-metal approach to clustering from BLAST results. Several command-line tools are invoked in succession and this may look scary if you have not been exposed to this kind of interaction before. Remember however that these are just simple recipes that should be easy to master. Get in contact if you are stuck. Alternatively, one may use mclblastline which is again a command-line tool. It encapsulates a number of other command-line tools and may be slightly less intimidating.
In the examples below the input is either the file xyz.blast in default blast format or the file xyz.cblast in columnar blast format obtained with the blast -m8 or -m9 options.
1. mcxdeblast --line-mode=abc --out=- xyz.blast | mcl - --abc -o xyz.clus
2. mcxdeblast --m9 --line-mode=abc --out=- xyz.cblast | mcl - --abc -o xyz.clus
3. cut -f 1,2,11 xyz.cblast |\
mcl - --abc --abc-neg-log -abc-tf 'mul(0.4343), ceil(200)' -o xyz.clus
All the lonely hyphens livening up the examples serve to seal the communication between mcxdeblast (the parse program) and mcl (the clustering program). The abc options indicate that mcxdeblast sends and mcl both receives and sends data in simple label-type format. Read more in the manual.
The first two examples use the blast parser mcxdeblast with respectively default blast format and columnar blast format. In these examples mcxdeblast log-transforms the blast e-values and sets the largest allowed transformed value to 200 corresponding with a lowest allowed e-value of 1e-200.
In the third line the columnar format is parsed by the standard UNIX cut(1) utility. The columnar format must have been created with the BLAST -m8 option so that no comment lines are present (alternatively these can be filtered out using grep of course). Now mcl has to apply the log-transformation itself, as specified with --abc-neg-log. The log transform is done with the natural logarithm; the transformation to the customary 10-based logarithm is achieved by the multiplication step mul(0.4343). The value is finally trimmed to the largest allowed value with ceil(200). The entire string 'mul(0.4343), ceil(200)' is argument to the transformation option -abc-tf.
It is possible to do all the transformations just once and cache the results. In each of the following examples two files xyz.mci and xyz.tab are created from which mcl can directly proceed.
1. mcxdeblast --line-mode=abc --out=- xyz.blast | mcxload -abc - --stream-mirror -write-tab xyz.tab -o xyz.mci
2. mcxdeblast --m9 --line-mode=abc --out=- xyz.cblast | mxload -abc - --stream-mirror -write-tab xyz.tab -o xyz.mci
3. cut -f 1,2,11 xyz.cblast |\
mcxload -abc - --stream-neg-log -stream-tf 'mul(0.4343), ceil(200)' --stream-mirror\
-write-tab xyz.tab -o xyz.mci
This may look intimidating, but inspection reveals that these are recipes in which the file name stem is the only thing that needs changing between different experiments. The --stream-mirror option has to be supplied to tell mcxload that it needs to ensure that the output graph is undirected. Since mcl will do this automatically no similar option was present in the previous examples. To obtain clusterings from xyz.mci and xyz.tab one proceeds as follows.
mcl xyz.mci -use-tab xyz.tab -I 1.4 mcl xyz.mci -use-tab xyz.tab -I 2 mcl xyz.mci -use-tab xyz.tab -I 3 mcl xyz.mci -use-tab xyz.tab -I 4 mcl xyz.mci -use-tab xyz.tab -I 5
This example shows a few clustering runs with different inflation values. The -o option is not used, so mcl will create meaningful and unique output names by itself.