Mermaid identifies sequence similarity for a set of sequences within a given fasta file. It uses an extremely simple approach. It takes all distinct k-mers from a given sequence and searches all other sequences for occurrences of these. It is possible to allow mismatches by specifying a maximum edit distance. There are also options for weeding out overrepresented k-mers, arising e.g. from repeats or low complexity sequence.
Mermaid returns a profile of sequence similarity by assigning to each position in sequence A the number of times it was included in a match from a k-mer in sequence B.
One use for this profile is to infer a measure of sequence similarity that is not hampered by sequence rearrangements.
Clearly, mermaid's time consumption is quadratic - correlated with the product of the lengths of the sequences. Perhaps the future will bring a more sophisticated approach.
Mermaid depends on the TRE regex library written by Ville Laurikari and on the tingea utility library.
Install TRE and tingea by unpacking them and issuing
./configure --prefix=<YOUR-PREFIX> --enable-static make make install--enable-static can be omitted. Configure and install mermaid as follows.
./configure --prefix=<YOUR-PREFIX> CFLAGS="-I<YOUR-PREFIX>/include" LDFLAGS="-L<YOUR-PREFIX>/lib" # EXAMPLE invocation: # ./configure --prefix=$HOME/local CFLAGS="-static -I$HOME/local/include" LDFLAGS="-L$HOME/local/lib make make install
There is much todo about everything. Customizable output format and customizable rules for exception positions (such as customarily denoted by X and N) come to mind.