MCL speed and memory

As stated on the characteristics page, the MCL implementation available on these pages has a complexity of $O(N k$ ²), where $N$ is the number of nodes in the graph, and $k$ is the number of resources allocated per node. This number can be chosen surprisingly low without affecting clustering quality. The reason is that MCL computes a localized process, and consequently it is possible to implement a pruning regime that takes advantage of this. The complexity is achieved by using a sparse matrix implementation. The maximum number of resources allocated per node directly translates to the maximum number of nonzero entries kept per column. For more information, refer to the FAQ.

There are several ways to speed up MCL clustering. All of these, except the threads approach, act by reducing the size of the data structures employed, and hence also reduce memory demands.

threads

The program can utilise multiple CPUs or cores by creating so-called threads. Ideally you use as many threads as you have CPUs or cores. The number of threads is specified with the -t option. Make sure that you do in fact have access to as many cores. In the LSF job submission system this is achieved with the bsub -n option. With T threads the speedup achieved will be approximately f T where f is a factor in the range [0.5-1].

resource allocation

If speed is of primary concern, lowering the -scheme option is a good idea. By default it is set to a conservative value of 7, implying that the program will compute a very accurate process at the cost of speed. Lower values, starting with 6, will speed up the program considerably by employing more aggressive pruning strategies. The resulting clusterings may change slightly, but will be consistent with clusterings resulting from a more accurate process. In short, the perturbation effect of the pruning process applied by mcl is a source of noise. It is small compared to the effect of changing the inflation parametrization or perturbing the edge weights. If the change is larger, this is because the computed process tends to converge prematurely, leading to finer-grained clusterings. As a result the clustering will be close to a subclustering of the clustering resulting from more conservative resource settings, and in that respect be consistent. All this can be measured using the program clm dist. It is possible to offset such a change by slightly lowering the inflation parameter.

As an alternative to -scheme one can use -resource <num>. The number <num> determines the number of resources each network node is allowed during the process. Effectively this is the maximum number of neighbours maintained for each node during the computation of successive graphs in the MCL process. In the default settings this number is approximately 1200 — very high. A very low value would be 100.

compilation options

When configuring, add compiler optimisation flags. For example:

./configure --prefix=<YOUR-PREFIX> CC=gcc CFLAGS='-O3'

where <YOUR-PREFIX> could be $HOME/local or /usr/local for example — note that the latter will not work if you do not have write permission to that part of the file system.

preprocessing

Perhaps not all edges in the graph are fully informative. If you have many edges per node (say a few hundred) it may be worthwhile reducing them. This can be done with the mcl -tf option. Two examples are:

-tf 'gq(10), #knn(100)' -tf 'gq(10), #ceilnb(200)'

In both cases, the transformation gq(10) removes any edges of weight below 10. This transformation is independent and can be left out if desired. Of course, any number can be chosen instead of 10. After that, in the first case (k-Nearest-Neighbours) edges are only kept that occur in the list of top-100 edges of highest weight in both of its incident nodes. In the second case the maximum node degree is imposed to be 200. Edges from nodes of highest degree are pruned first, proceeding until all node degrees satisfy the given threshold.

preprocessing 2

Perhaps there is not sufficient contrast in the edge weights. If the edge weights vary in the range [0.9-1] (perhaps because they are correlations) the edge weights are hardly informative to mcl. In this case the following transformation is appropriate.

-tf 'add(-0.9)'

It is possible to combine transformations, such as

-tf 'add(-0.9), #knn(100)'

If you have unfiltered correlation values, filtering at a permissive threshold of 0.7 followed by shifting and edge reduction is achieved as follows:

preprocessing 3

Perhaps there is sufficient contrast in the edge weights, but the graph is very densely connected and shifting edge weights is not applicable. It may be beneficial to further increase the contrast. This is done using the -pi option (pre-inflation). Roughly speaking, the setting -pi 2 increases the contrast by a factor of two.