Thesis
Reports
Articles
Flow pictorial
Third party publications
Back to the main MCL section

Graph Clustering by Flow Simulation

PhD thesis by Stijn van Dongen


The work for this thesis was carried out at the CWI, the Dutch National Research Institute for Mathematics and Computer Science, under supervision of Jan van Eijck and Michiel Hazewinkel. The thesis is centered around a fast, scalable, deterministic cluster algorithm for graphs (called the Markov Cluster algorithm or MCL algorithm) which is motivated by a heuristic formulated in terms of stochastic flow. The algorithm itself is defined in terms of stochastic matrices (Markov matrices) and two operators defined on them.

See also this ERCIM article.

Smashing evidence regarding the wisdom of using the string "MCL" to refer to the Markov Cluster Process and Algorithm.

This mcl implementation is still being hacked and improved (so I hope) by me. The goal for this MCL implementation is to be as fast as possible, and to be able to handle as wide a range of graphs as possible. If the diameters of the natural clusters are not too large, then this implementation should be able to satisfactorily handle arbitrarily large graphs as long as you've got the memory - requirements are in the order of kN, where N is the number of nodes of your graph, and where k is the pruning constant used, typically in the range 300 to 800 on PC's or workstations.

Thesis

Stijn van Dongen, Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, May 2000.

Download a gzipped PDF of the thesis (3.5Mb, unpacked 4.5Mb). The University of Utrecht publishes the thesis as well. They host a PDF of each separate chapter, plus the whole shebang in one piece as well. The PS file is unfortunately only useful if you have Lucida fonts installed on your system. Few people have them, so I withhold the PS source - mail me if you are interested. The thesis was written in LaTeX, and I did the uttermost to avoid that dreary default LaTeX look. This included using a different font set - it is one of the (few) shortcomings of (La)TeX that there aren't that many fonts around to choose from. But otherwise (La)TeX did a great job, and the PDF source seems to be pretty small for a 174 page thesis [is it? I am not so sure anymore]. A second reason for this modest size is that nearly all figures came from home-brewn PostScript. Some remarks on this are made on my PostScript fragment, and two examples are found below.

Reports

The meat of the thesis [constists of / was cut into] three parts which can be found at CWI's report repository. The references and CWI links are:

Stijn van Dongen. A cluster algorithm for graphs. Technical Report INS-R0010, National Research Institute for Mathematics and Computer Science in the Netherlands, Amsterdam, May 2000. Download.

Stijn van Dongen. A stochastic uncoupling process for graphs. Technical Report INS-R0011, National Research Institute for Mathematics and Computer Science in the Netherlands, Amsterdam, May 2000. Download.

Stijn van Dongen. Performance criteria for graph clustering and Markov cluster experiments. Technical Report INS-R0012, National Research Institute for Mathematics and Computer Science in the Netherlands, Amsterdam, May 2000. Download.

Local copies are A cluster algorithm for graphs, A stochastic uncoupling process for graphs, and Performance criteria for graph clustering and Markov cluster experiments.

Articles

The MCL algorithm has been successfully applied in the field of protein family detection. I co-authored a paper with Anton Enright and Christos Ouzounis. It was published as

Enright A.J., Van Dongen S., Ouzounis C.A. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Research 30(7):1575-1584 (2002).

Flow pictorial

The graph in the upper left below is the two-step graph derived from a graph which was used as the running example in the article by J. Falkner, F. Rendl, and H. Wolkowicz: A computational study of graph partitioning, Mathematical Programming, 66, (1994), pp. 211-239.

The grey level of a bond between two nodes indicates the maximum amount of flow taken over the two directions. The grey level of a node indicates the total amount of incoming flow. A dark bond between a white node and a dark node thus indicates that all flow is going from the white node in the direction of the dark node. It is seen that flow is eventually separated into different regions, yielding a cluster interpretation of the initial graph.

Flow pictorial

Third party publications

Other people considered and/or used MCL/mcl (respectively the algorithm in abstracto and this implementation. I used to maintain a collection of references and links, but have given up. For a quick overview, try the Google Scholar search suggested on the main page.

Back to the main MCL page.