mermaid

Mermaid identifies sequence similarity for a set of sequences within a given fasta file. It uses an extremely simple approach. It takes all distinct k-mers from a given sequence and searches all other sequences for occurrences of these. It is possible to allow mismatches by specifying a maximum edit distance. There are also options for weeding out overrepresented k-mers, arising e.g. from repeats or low complexity sequence.

Mermaid returns a profile of sequence similarity by assigning to each position in sequence A the number of times it was included in a match from a k-mer in sequence B.

One use for this profile is to infer a measure of sequence similarity that is not hampered by sequence rearrangements.

Clearly, mermaid's time consumption is quadratic - correlated with the product of the lengths of the sequences. Perhaps the future will bring a more sophisticated approach.

Mermaid depends on the TRE regex library written by Ville Laurikari and on the tingea utility library.

Install TRE and tingea by unpacking them and issuing

./configure --prefix=<YOUR-PREFIX>  --enable-static
make
make install
--enable-static can be omitted. Configure and install mermaid as follows.
./configure  --prefix=<YOUR-PREFIX>  CFLAGS="-I<YOUR-PREFIX>/include"  LDFLAGS="-L<YOUR-PREFIX>/lib"

# EXAMPLE invocation:
# ./configure --prefix=$HOME/local CFLAGS="-static -I$HOME/local/include" LDFLAGS="-L$HOME/local/lib

make
make install

Download

TODO

There is much todo about everything. Customizable output format and customizable rules for exception positions (such as customarily denoted by X and N) come to mind.