diagonalsw


Table of Contents

1. Introduction
2. Download
3. Installation
3.1. Installation with prebuilt package
3.2. Building from source
3.2.1. Building from source on Unix
4. Usage
4.1. diagonalsw command line options
4.2. diagonalsw input file formats
4.3. Examples
5. File formats
5.1. Fasta format
5.2. Profile file format
5.3. Scoring matrix file format
6. Implementation details
6.1. Rearranging the column vectors into diagonal vectors
6.1.1. SSE version
6.1.2. Altivec version
6.2. Parallelization
7. Performance - benchmarked computation speed
8. Related projects
8.1. Striped Smith-Waterman by Farrar
8.2. swps3
8.3. ssearch in the fasta software package
References

List of Tables

1. diagonalsw input file formats

List of Examples

1. diagonalsw with query in fasta format
2. diagonalsw with specified gap penalties
3. diagonalsw with query in profile file format
4. diagonalsw with benchmarking
5. diagonalsw with verbose benchmarking
6. db.fasta, an example file in fasta format
7. 1R69.psi, an example file in the profile file format
8. blosum50.mat, an example file in the Scoring matrix file format

1. Introduction

diagonalsw is a software that implements the Smith-Waterman algorithm with the SIMD instruction sets found in modern CPUs ( SSE4.1 for the x86 platform and AltiVec for the PowerPC platform ). The software is open source software. It is written in the programming languages C and C++ and it is licensed under the MIT license.

The primary URL for this document is http://diagonalsw.sourceforge.net.

2. Download

Download the software from the sourceforge project page. The latest version of diagonalsw is 0.9.0.

3. Installation

3.1. Installation with prebuilt package

As Intel Intel® Threading Building Blocks (TBB) version 2.2 is rather new and has not been packaged yet for all platforms, building from source code is the recommended way right now ( 2009-11-26 ). Try "make package" if you want to start building your own rpm or deb package. Related to the packaging of diagonalsw is the FAQ ( Is there a version of TBB that provides statically linked libraries? ).

3.2. Building from source

3.2.1. Building from source on Unix

To build diagonalsw on Unix ( e.g. Linux, MacOSX, CygWin ) you need to have this installed

To install TBB run something like this:

$ tar xfz /scratch/tbb22_20091101oss_src.tgz 
$ cd tbb22_20091101oss/
$ less README 
$ make

If you have installed Intel Intel® Threading Building Blocks (TBB) into a non-standard location, please first source the tbbvars.sh, in other words something like this

$ . /home/myuser/tbb22_20091101oss/build/linux_intel64_gcc_cc4.3.3_libc2.9_kernel2.6.28_release/tbbvars.sh

If you have the diagonalsw source code in the directory /tmp/diagonalsw-0.9.0 and you want to install diagonalsw into the directory /tmp/install, and you have installed gengetopt in the non-default location ~/gengetopt/inst/bin/, then first run cmake then make and then make install

$ mkdir /tmp/build
$ cd /tmp/build
$ cmake -DCMAKE_INSTALL_PREFIX=/tmp/install -DCMAKE_PREFIX_PATH=~/gengetopt/inst/bin/ /tmp/diagonalsw--0.9.0  && make && make install
-- The C compiler identification is GNU
-- Check for working C compiler: /usr/bin/gcc
-- Check for working C compiler: /usr/bin/gcc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Configuring done
-- Generating done
-- Build files have been written to: /tmp/build
[  5%] Generating smith_waterman_sse_word.2loadEFH.c
[ 10%] Generating cmdline_diagonalsw.c, cmdline_diagonalsw.h
[ 15%] Generating smith_waterman_sse_byte.2load_HFsave8thnormal.c
[ 20%] Generating smith_waterman_sse_word.2loadnormal.c
[ 25%] Generating smith_waterman_sse_byte.2load_HFsave8thH.c
[ 30%] Generating smith_waterman_sse_word.2loadH.c
[ 35%] Generating smith_waterman_sse_byte.2load_HFsave8thEFH.c
Scanning dependencies of target diagonalsw
[ 40%] Building C object src/c/CMakeFiles/diagonalsw.dir/read_matrix.c.o
[ 45%] Building C object src/c/CMakeFiles/diagonalsw.dir/read_sequence.c.o
[ 50%] Building C object src/c/CMakeFiles/diagonalsw.dir/diagonalsw.c.o
[ 55%] Building C object src/c/CMakeFiles/diagonalsw.dir/cmdline_diagonalsw.c.o
[ 60%] Building C object src/c/CMakeFiles/diagonalsw.dir/create_profile.c.o
[ 65%] Building C object src/c/CMakeFiles/diagonalsw.dir/smith_waterman_reference_impl.c.o
[ 70%] Building C object src/c/CMakeFiles/diagonalsw.dir/smith_waterman_sse_byte.2load_HFsave8thnormal.c.o
[ 75%] Building C object src/c/CMakeFiles/diagonalsw.dir/smith_waterman_sse_word.2loadnormal.c.o
[ 80%] Building C object src/c/CMakeFiles/diagonalsw.dir/smith_waterman_sse_byte.2load_HFsave8thH.c.o
[ 85%] Building C object src/c/CMakeFiles/diagonalsw.dir/smith_waterman_sse_word.2loadH.c.o
[ 90%] Building C object src/c/CMakeFiles/diagonalsw.dir/smith_waterman_sse_byte.2load_HFsave8thEFH.c.o
[ 95%] Building C object src/c/CMakeFiles/diagonalsw.dir/smith_waterman_sse_word.2loadEFH.c.o
[100%] Building C object src/c/CMakeFiles/diagonalsw.dir/sse_funcs.c.o
Linking C executable diagonalsw
[100%] Built target diagonalsw

If you want to build the html documentation ( i.e. this page ) you need to pass the -DBUILD_DOCUMENTATION=ON option to cmake.

4. Usage

4.1. diagonalsw command line options

Type diagonalsw --help to see the command line options

[user@linux ~]$ diagonalsw --help
diagonalsw 0.9.0

A high-performance version of the Smith-Waterman alignment algorithm

Usage: diagonalsw [OPTIONS]...

  -h, --help                    Print help and exit
      --version                 Print version and exit
  -d, --databasefile=filename   A FASTA-formatted sequence database file
  -q, --queryfile=filename      A single sequence file in FASTA-format
  -p, --profilefile=filename    A profile file
  -s, --matrixfile=filename     Integer similarity matrix file ( scoring matrix 
                                  )
  -i, --gapopen=INT             Penalty for starting alignment gap 
                                  (non-positive integer)  (default=`-10')
  -e, --gapextend=INT           Penalty for extending alignment gap 
                                  (non-positive integer)  (default=`-2')
  -u, --print-sequence-id       In addition to the score, also print the 
                                  sequence-id  (default=off)
  -V, --verify-algorithm        verify vector implementation against scalar 
                                  implementation  (default=off)
  -D, --disable-threading       disable all threading  (default=off)
  -t, --nr-of-threads=INT       numbers of threads (normally autodetected)
  -T, --tokens-per-thread=INT   numbers of tokens per thread (normally 
                                  autodetected)  (default=`5')
  -W, --sequences-per-compute-chunk=INT
                                nr of sequences each thread are given per 
                                  computation iteration ( doesn't influence the 
                                  result but may influence the speed )  
                                  (default=`10')

benchmarking:
  To get a more fair benchmark you should run a computation that lasts at least 
  some minutes on a machine that is not under load by other processes.
  -b, --benchmark               print computation speed (unit is GigaCells/s)  
                                  (default=off)
  -B, --benchmark-verbose       print computation speed (unit is GigaCells/s) 
                                  and additional information such as user time, 
                                  number of db sequences that needed to use the 
                                  word sized function, length of querysequence, 
                                  sum of db sequence lengths, number of db 
                                  sequences  (default=off)

limits:
  the default limits should work fine in most cases. Warning: Increasing these 
  values will increase the memory consumption of the program.
  -K, --max-sequence-name-length=INT
                                Maximum allowed length of a sequence name  
                                  (default=`256')
  -L, --max-query-sequence-length=INT
                                Maximum allowed length of the query sequence  
                                  (default=`2000')
  -M, --max-db-sequence-length=INT
                                Maximum allowed length of any of the db 
                                  sequences  (default=`10000')

4.2. diagonalsw input file formats

Table 1. diagonalsw input file formats

file formatshort optiondescription
fasta format-qquery file in fasta format Section 5.1, “Fasta format”
fasta format-ddatabase file in fasta format Section 5.1, “Fasta format”
profile format-pquery file in profile format Section 5.2, “Profile file format”
scoring matrix file format-sSubstitution matrix ( Scoring matrix ) Section 5.3, “Scoring matrix file format”


4.3. Examples

Example 1. diagonalsw with query in fasta format

We use the file described in Example 6, “db.fasta, an example file in fasta format” as database file and we use the file described in Example 8, “blosum50.mat, an example file in the Scoring matrix file format” as scoring matrix.

[user@linux ~]$ cat query.fasta
>MyInterestingProtein
EHIATYYNDQMLKKPTWYVBZX

[user@linux ~]$ diagonalsw -q query.fasta -d db.fasta -s blosum50.mat
146
135
68


Example 2. diagonalsw with specified gap penalties

This example is the same as Example 1, “diagonalsw with query in fasta format” but with gap penalties specified. The penalty for starting an alignment gap was chosen to be -8 and the penalty for extending an alignment gap was chosen to be -1.

[user@linux ~]$ diagonalsw -q query.fasta -d db.fasta -s blosum50.mat -i -8 -e -1
146
135
74


Example 3. diagonalsw with query in profile file format

[Note]Note

Specifying query in profile format is not implemented yet. The function read_profile() needs to be written.

We use the file described in Example 7, “1R69.psi, an example file in the profile file format” as the query. We use the file described in Example 6, “db.fasta, an example file in fasta format” as database file and we use the file described in Example 8, “blosum50.mat, an example file in the Scoring matrix file format” as scoring matrix.

[user@linux ~]$ diagonalsw -p 1R69.psi -d db.fasta -s blosum50.mat
error: Sorry. The profile parsing function read_profile() is not yet implemented


Example 4. diagonalsw with benchmarking

This example is the same as Example 1, “diagonalsw with query in fasta format” but with benchmarking. The computation speed Gigacells/second is printed.

[user@linux ~]$ diagonalsw -q query.fasta -d db.fasta -s blosum50.mat -b
0.000783

Note, db.fasta is too small ( containing only 3 sequences ) to give a fair result for Gigacells/second. See also Section 7, “Performance - benchmarked computation speed”.


Example 5. diagonalsw with verbose benchmarking

This example is the same as Example 1, “diagonalsw with query in fasta format” but with verbose benchmarking. Benchmarking information is printed.

[user@linux ~]$ diagonalsw -q query.fasta -d db.fasta -s blosum50.mat -B
time = 0.00279107
nr of db sequences that needed to use the word sized function due to an overflow in the byte sized function = 0
nr of db sequences = 3
computation speed ( Giga cells/second ) = 0.000489

Note, db.fasta is too small ( containing only 3 sequences ) to give a fair result for Gigacells/second. See also Section 7, “Performance - benchmarked computation speed”.


5. File formats

This software package handles the following file formats

5.1. Fasta format

Read more about the Fasta format on Wikipedia.

Example 6. db.fasta, an example file in fasta format

The example file db.fasta contains

>Alpha
EHIATYYNDQMLKKPTWYVBZX
>Beta
AHIATYYNDQMLKKPTWYVB
>Gamma
AYTMLKKYYNDQPTWYVBYP


5.2. Profile file format

Example 7. 1R69.psi, an example file in the profile file format

The example file 1R69.psi contains

Last position-specific scoring matrix computed, weighted observed percentages rounded down, information per position, and relative weight of gapless real matches to pseudocounts
           A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V   A   R   N   D   C   Q   E   G   H   I   L   K   M   F   P   S   T   W   Y   V
    1 S    1 -4  3 -3 -4 -3 -3 -4 -4 -4 -5 -3 -4 -5 -4  3  6 -6 -5 -3   13   0  12   0   0   0   0   0   0   0   0   0   0   0   0  25  50   0   0   0  1.12 1.44
    2 I   -4 -5 -6 -7 -4 -4 -6 -7 -6  5  3 -6  1  4 -4 -5 -4 -4 -2  3    0   0   0   0   0   1   0   0   0  31  30   0   3  16   1   0   0   0   1  17  1.04 1.63
    3 S    2 -3 -2 -4 -4 -3 -4  6 -2 -6 -6 -1 -4 -5 -3  1 -3 -6 -4 -4   15   1   1   0   0   1   0  64   1   0   0   3   0   0   1   9   1   0   0   1  1.28 1.67
    4 S    0  0  1  2 -3  3  4 -3  1 -4 -3  1 -2 -5 -2  0 -1 -6 -3 -5    8   4   6  11   0  11  28   1   3   1   2  10   1   0   1   8   3   0   1   0  0.51 1.69
    5 R   -3  7  1 -3 -5  0 -2 -5 -2 -3 -4  3 -3 -4 -5 -3 -2 -3 -3 -3    1  62   6   1   0   3   1   0   1   2   1  15   0   1   0   1   2   0   1   1  1.33 1.68
    6 V   -4 -6 -6 -6 -4 -6 -6 -7 -6  6  4 -6  0 -1 -6 -5 -3 -5 -4  2    0   0   0   0   0   0   0   0   0  45  38   0   2   2   0   0   0   0   0  11  1.27 1.68
    7 K    0  5 -2 -3 -4  0 -1 -5 -2 -3 -4  5 -4 -4 -5 -1 -2 -4 -3 -3    7  35   2   1   0   3   4   0   1   1   2  36   0   0   0   3   2   0   1   2  0.98 1.71
    8 S    1  1 -1 -1 -5  3  3 -3  1 -2 -1  1  0 -1 -4  0  0 -3  1 -2   11   8   2   3   0  11  21   2   4   3   5   7   2   3   0   7   6   0   4   2  0.29 1.70
    9 K    3  2 -4 -5 -2 -1 -1 -4 -2  1  3 -1 -1 -2 -4 -3 -3  1 -1  0   22  13   0   0   1   3   4   1   1   8  31   3   1   2   0   1   0   2   3   5  0.45 1.69
   10 R   -3  8 -4 -5 -3 -2 -3 -6 -4 -3 -2 -1  3 -4 -5 -3 -3 -6 -4 -4    1  82   0   0   1   0   0   0   0   2   4   1   7   1   0   1   1   0   0   1  1.84 1.68
   11 I   -1  1  0 -1 -2  1  3 -3 -2 -1 -1  4  0 -5 -3 -2 -1  0 -5 -3    4   6   4   3   1   7  22   1   1   5   6  28   2   0   1   1   5   1   0   2  0.50 1.68
   12 Q    1  2 -1 -1 -3  2  3 -3  1 -2 -1  2  1 -4 -5 -1 -2 -2 -1 -3   14  10   2   3   0   9  19   1   4   3   8  13   3   1   0   4   2   1   2   1  0.32 1.68
   13 L    1  3  0 -4 -1 -1 -3 -4 -1 -2  2  3  0 -2 -5 -2 -2 -3 -2 -2   13  18   5   0   1   3   1   1   2   2  24  19   3   2   0   3   2   0   1   2  0.44 1.67
   14 G   -2 -1  3  0 -3 -1 -2  6  0 -5 -6  1 -4 -6 -5 -1 -4 -6 -4 -5    2   4  14   4   1   3   2  53   2   1   0  11   0   0   0   3   0   0   0   0  1.04 1.64
   15 L   -4 -3 -6 -6 -3 -4 -4 -6 -3  2  5 -4  5  0 -6 -5 -3  3  1 -1    1   1   0   0   0   0   1   0   1   9  51   0  18   4   0   0   1   5   4   3  1.05 1.63
   16 N   -3 -2  0 -3 -4 -3 -3 -3 -2 -4 -4  0 -4 -5 -3  4  6 -6 -5 -3    0   1   4   1   0   0   1   1   1   0   1   6   0   0   1  34  50   0   0   1  1.17 1.62
   17 Q   -1  0 -3 -3 -5  8 -1 -3 -2 -2  0 -1  0 -5 -3 -3 -3 -5 -4 -2    4   4   0   0   0  67   1   2   0   3   9   1   2   0   1   1   1   0   0   2  1.35 1.63
   18 A    1  2 -1  1 -5  2  3 -2 -1 -4 -4  2 -1 -5 -2  1  0 -6 -2 -2   14  10   2   6   0  10  19   3   1   1   2  11   1   0   2   9   6   0   1   3  0.36 1.64
   19 E    1 -2 -1  3 -6  4  5 -3 -1 -5 -5  0 -2 -5 -4 -1 -2 -6 -2 -3   12   2   2  13   0  16  38   1   1   0   0   5   1   0   0   3   2   0   1   1  0.80 1.64
   20 L   -3 -5 -6 -7 -4 -5 -6 -7 -6  1  6 -6  1  2 -6 -5 -3 -5 -3  1    2   0   0   0   0   0   0   0   0   4  73   0   3   9   0   0   1   0   0   8  1.30 1.63
   21 A    6 -3 -4 -5 -1 -4 -2  1 -5 -5 -5 -3 -4 -5 -4  0 -3 -6 -5 -2   79   1   0   0   1   0   2  10   0   0   0   1   0   0   0   4   0   0   0   2  1.39 1.63
   22 Q   -1  2  0  2 -6  1  4 -2  0 -3 -3  2 -3 -4 -4  0 -1 -6 -3 -3    5  10   3  13   0   5  28   3   2   1   2  14   1   1   0   5   3   0   1   2  0.51 1.61
   23 K    1  4 -3 -3 -3  1  0 -3 -1  0  1  3  1 -3 -4 -2 -3 -3 -2 -2   14  20   1   1   0   5   4   1   2   5  14  20   3   1   0   3   1   0   2   2  0.43 1.71
   24 V    2 -5 -5 -4  2 -5 -5 -3 -5  3  3 -5  2 -1 -5  0  1 -5 -4  3   17   0   0   1   4   0   0   2   0  15  24   0   4   2   0   7   7   0   0  17  0.53 1.67
   25 G   -3 -2  1  1 -3 -3 -2  6 -1 -6 -5 -2 -5 -4 -5 -1 -3 -6 -3 -5    1   2   6   6   1   1   2  71   1   0   1   2   0   1   0   3   1   0   1   1  1.39 1.66
   26 T   -3 -3 -6 -4  0 -3 -5 -6 -6  4  2 -3  2 -1 -6 -3 -1  1 -3  5    1   1   0   1   2   1   0   0   0  23  15   1   4   2   0   1   4   2   1  40  0.84 1.68
   27 T    0 -2  0  1 -2 -2 -2 -1  0 -3 -5  1 -4 -6 -1  5  3 -6 -5 -3    5   2   3   6   1   1   2   4   3   1   1   8   0   0   3  40  21   0   0   1  0.67 1.69
   28 Q   -1  4  0 -2 -3  4  0 -3 -1 -2 -2  2 -4  0  1 -2 -2 -2  0 -1    5  20   4   1   0  19   5   1   1   2   3  13   0   4   7   2   3   1   4   4  0.43 1.67
   29 Q    0  0  1 -2 -5  4 -1  1 -2 -4 -5 -1 -1 -6  1  3  0 -1 -3 -2    7   4   8   1   0  22   3  10   1   1   1   3   2   0   8  21   7   1   1   3  0.47 1.67
   30 S    2 -2 -2 -4 -4  2 -2 -3 -1 -2 -3 -2  1 -1 -4  2  4  0  2 -2   18   1   2   0   0   9   2   1   1   1   2   2   4   3   0  15  28   1   7   2  0.50 1.68
   31 I   -3 -6 -6 -6 -2 -2 -6 -6 -5  6  2 -5  1 -2 -6 -5 -3 -1  1  3    1   0   0   0   1   2   0   0   0  50  15   0   2   1   0   0   1   1   5  19  1.12 1.68
   32 E    0 -1  0 -3 -3  2 -2  0  1 -3 -3 -2 -2 -3 -4  5 -1  0 -1 -4    7   2   5   1   0  10   1   8   3   1   2   2   1   1   0  48   2   2   3   1  0.68 1.69
   33 Q   -2  3  3  0 -6  3 -1 -2  1 -4 -2  4  0 -3 -5 -1 -3 -6 -2 -4    3  15  15   5   0  13   2   3   3   1   5  24   2   1   0   4   1   0   2   1  0.54 1.69
   34 L   -3 -6 -6 -5 -5 -3 -6 -6 -2  5  1 -6  0  0 -6 -4 -3  8  5  0    2   0   0   1   0   1   0   0   1  30  14   0   2   2   0   1   1  22  19   6  1.26 1.69
   35 E   -3 -2 -3 -1 -2 -1  7 -5 -3 -4 -2 -1 -2 -3 -4 -3 -3 -6 -5 -4    1   1   0   1   1   1  85   0   0   1   5   2   1   1   0   0   1   0   0   1  1.70 1.68
   36 N   -1  4  4 -2 -3  0 -3 -3  1 -4 -3  2 -3 -6 -5  2  0 -6 -3 -4    6  23  22   1   1   3   1   1   3   1   2  12   1   0   0  17   6   0   1   1  0.66 1.68
   37 G   -2 -2  2  0 -5 -2 -2  6  0 -6 -6 -3 -3 -6 -5 -1 -3 -6 -4 -4    3   2   9   6   0   1   2  67   2   0   0   1   1   0   0   4   1   0   1   1  1.31 1.65
   38 K   -1  3  0  0 -5  1  2 -2 -1 -2 -1  4 -2 -4 -3 -2  0 -6 -3 -2    4  16   3   4   0   6  15   3   2   3   6  25   1   1   1   1   5   0   1   3  0.44 1.67
   39 T    0  3  2 -2 -5 -1 -2 -2 -2  0 -3 -1 -1 -3 -3  3  4 -5 -1 -1    7  14   8   2   0   2   3   2   1   5   2   2   2   1   1  20  23   0   2   4  0.42 1.69
   40 K   -2  2  1 -2 -6  0 -2 -2 -3 -2 -2  1 -2 -3  6 -1 -2 -6 -4 -2    2  12   7   2   0   5   2   3   0   2   5   9   1   1  42   3   2   0   1   2  0.87 1.72
   41 R   -1  2  2  1 -2 -2 -2  0 -2 -1 -3  0 -2 -4  2  3  0 -6 -3 -2    4  13  10   7   1   2   2   8   1   3   3   5   1   1  10  21   6   0   1   3  0.32 1.76
   42 P    1 -2 -3 -2 -3 -2 -2 -1 -3  1  0 -1  0 -2  3  3  0 -3 -2 -1   11   2   1   2   0   2   2   4   1   8   9   3   2   2  15  24   6   0   1   5  0.30 1.77
   43 R    0  1 -1  2 -5 -1  3 -1 -1 -3 -3  1 -2 -4 -2  2  0 -1 -3 -2    8   8   3  12   0   2  17   6   1   1   3   7   1   0   2  17   6   1   1   2  0.30 1.77
   44 F   -2  1  2  0 -4 -1 -1 -4  0 -2 -1  2  0  1 -3  0  3 -2  1 -1    4   8   8   5   0   3   3   1   2   2   7  13   3   6   1   7  19   1   5   4  0.24 1.78
   45 L   -2 -3 -4 -6 -4 -4 -5 -6 -6  3  5 -4  0 -1 -6 -4 -2 -1 -3  1    3   2   1   0   0   0   1   0   0  13  63   1   2   1   0   1   3   1   1   8  1.04 1.79
   46 P    1 -1 -1 -2 -2  1  0 -3  0  2  1  0  1  0  0 -1 -2  0  0  1   12   3   3   2   1   7   7   2   2  11  11   6   3   5   4   5   2   1   3  11  0.11 1.82
   47 E    1  2 -1  0 -4  2  2 -3 -3 -4 -3  4  0 -6 -3  0 -2 -6 -3 -3   14  11   3   5   0   9  15   2   0   1   3  22   2   0   1   6   2   0   1   2  0.45 1.82
   48 L   -3 -4 -6 -7 -1 -6 -5 -7 -6  5  4 -6  1  0 -6 -5 -3 -1 -2  1    2   1   0   0   1   0   0   0   0  39  43   0   3   3   0   0   1   1   1   6  1.16 1.80
   49 A    6 -3 -4 -4  2 -3 -3 -2 -3 -3 -1 -2 -3 -3 -4  2 -2 -6 -4 -3   65   1   0   1   4   0   1   2   0   1   5   2   0   1   0  15   1   0   0   1  1.09 1.80
   50 S    0  2  2  2 -4  2  3 -1  0 -3 -4  3 -2 -5 -4 -1 -1 -4 -5 -4    8  12   9  10   0   8  16   4   2   2   2  18   1   0   0   5   4   0   0   1  0.42 1.80
   51 A    4 -2 -4 -5  0 -2 -2 -2 -3  1  0 -1 -1  1 -5 -2 -1 -5  0  2   41   2   1   0   2   2   2   2   1   8   8   3   1   5   0   2   3   0   4  14  0.52 1.81
   52 L   -4 -4 -6 -6  1 -4 -5 -6 -3 -1  6 -5  0  4 -6 -5 -2 -4  0 -2    0   1   0   0   3   0   1   0   1   2  68   1   1  17   0   1   2   0   3   1  1.25 1.79
   53 G   -3 -2  3  2 -6  0  1  5  1 -6 -6  0 -4 -6 -4 -1 -4 -6 -6 -5    1   2  15  13   0   4   8  43   3   0   0   5   0   0   1   4   0   0   0   0  0.90 1.81
   54 V   -2 -3 -6 -6  3 -5 -5 -5 -4  3 -1 -3  0 -2 -6 -4  0 -2 -2  6    2   2   0   0   6   0   0   1   0  14   2   1   2   1   0   1   6   1   1  59  1.13 1.81
   55 S   -2 -1  2  3 -4 -2 -1 -3 -1 -4 -5 -2 -5 -6  2  4  2 -6 -5 -4    2   3   9  16   0   1   4   1   1   1   1   2   0   0  12  35  12   0   0   1  0.71 1.80
   56 V    1 -3 -4 -4 -4 -4 -2 -6 -3  3  1 -4  0 -3  4 -3  0 -6 -2  3   10   1   0   0   0   0   2   0   1  15  12   1   2   1  22   2   5   0   1  24  0.64 1.80
   57 D   -1 -3  1  5 -3 -1  2 -2  0 -3 -1 -2  0 -4 -3  1  0 -6 -3 -3    5   1   6  29   1   2  14   3   3   2   7   2   3   1   1  12   5   0   1   2  0.48 1.80
   58 W   -2 -3 -4 -1 -6  0  1 -4 -1 -3 -4 -3 -2  2 -6 -2 -2  9  6 -2    4   1   0   4   0   4   8   1   1   1   1   1   1   8   0   3   3  30  25   2  1.34 1.80
   59 L   -5 -3 -6 -7 -5 -5 -5 -7 -6  0  6 -6  0  4 -6 -5 -4 -4 -3 -1    0   1   0   0   0   0   1   0   0   4  72   0   1  16   0   0   1   0   0   3  1.40 1.77
   60 L    0 -3 -5 -4 -4 -1 -1 -5 -3  2  4 -1  1  4 -5 -3 -3 -4 -1  0    7   1   0   1   0   3   5   0   0  10  35   3   3  20   0   1   1   0   2   6  0.61 1.75
   61 N   -1  0  0 -1 -3 -2  2  2  1 -4 -4 -1 -1  0 -2  1  3 -5  1 -4    5   5   5   3   0   1  12  16   3   1   1   4   1   4   2   9  22   0   5   1  0.33 1.73
   62 G   -2 -1 -2 -1 -5 -3 -1  6 -5 -4 -3 -2 -4 -2 -4  0 -1 -3 -4 -5    2   4   2   3   0   1   5  66   0   1   3   2   0   2   1   5   3   0   1   1  1.17 1.73
   63 T   -1  0  1  1 -5  0  3 -4 -1 -1 -1  1 -1 -3 -1  1  2 -6 -2 -1    5   5   6   9   0   3  19   1   1   4   6   7   1   1   4   9  14   0   1   5  0.25 1.71
   64 S   -2 -1  1  3 -3 -1  1  4 -2 -6 -5 -2 -4 -4  0  1 -1 -6 -4 -4    2   3   5  16   0   2  10  36   1   0   1   2   0   1   5  10   4   0   1   1  0.64 1.69
   65 D   -1 -2  1  4 -6  0  4 -4 -2 -3 -4 -1 -3 -6  0 -1  2 -6 -4 -4    4   2   5  26   0   4  31   1   1   1   1   3   1   0   4   4  11   0   0   1  0.73 1.68
   66 S    1 -3 -3 -3 -5 -2 -2 -3 -3 -2 -3 -2  6 -3  5  2  1 -6 -5 -3   10   1   1   1   0   2   3   1   0   2   1   2  20   1  31  13  10   0   0   1  0.85 1.63
   67 N    2 -1  5  2 -5  1  0 -3  0 -4 -5 -1 -4 -3 -1  1 -1 -6 -1 -3   21   3  31  10   0   5   4   0   2   0   0   4   0   1   2   9   3   0   3   1  0.61 1.46
   68 V    3 -3 -2 -3 -4 -3 -2 -4 -5  3  0 -3 -2 -4 -1 -2 -1 -5  0  4   26   1   2   1   0   1   3   0   0  17   9   1   0   0   4   2   3   0   3  28  0.51 1.33
   69 R   -4  7 -3 -4 -6  2 -2 -4 -3 -5 -5  4 -4 -5 -4 -3 -3 -5 -4 -5    0  69   0   0   0   9   0   0   0   0   0  22   0   0   0   0   0   0   0   0  1.54 1.06

                      K         Lambda
Standard Ungapped    0.1251     0.3089
Standard Gapped      0.0410     0.2670
PSI Ungapped         0.1906     0.3174
PSI Gapped           0.0583     0.2670


5.3. Scoring matrix file format

Example 8. blosum50.mat, an example file in the Scoring matrix file format

The example file blosum50.mat contains

#  Matrix made by matblas from blosum50.iij
#  BLOSUM Clustered Scoring Matrix in 1/3 Bit Units
#  Blocks Database = /data/blocks_5.0/blocks.dat
#  Cluster Percentage: >= 50
#  Entropy =   0.4808, Expected =  -0.3573
   A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V  B  Z  X
A  5 -2 -1 -2 -1 -1 -1  0 -2 -1 -2 -1 -1 -3 -1  1  0 -3 -2  0 -2 -1 -1
R -2  7 -1 -2 -4  1  0 -3  0 -4 -3  3 -2 -3 -3 -1 -1 -3 -1 -3 -1  0 -1
N -1 -1  7  2 -2  0  0  0  1 -3 -4  0 -2 -4 -2  1  0 -4 -2 -3  4  0 -1
D -2 -2  2  8 -4  0  2 -1 -1 -4 -4 -1 -4 -5 -1  0 -1 -5 -3 -4  5  1 -1
C -1 -4 -2 -4 13 -3 -3 -3 -3 -2 -2 -3 -2 -2 -4 -1 -1 -5 -3 -1 -3 -3 -2
Q -1  1  0  0 -3  7  2 -2  1 -3 -2  2  0 -4 -1  0 -1 -1 -1 -3  0  4 -1
E -1  0  0  2 -3  2  6 -3  0 -4 -3  1 -2 -3 -1 -1 -1 -3 -2 -3  1  5 -1
G  0 -3  0 -1 -3 -2 -3  8 -2 -4 -4 -2 -3 -4 -2  0 -2 -3 -3 -4 -1 -2 -2
H -2  0  1 -1 -3  1  0 -2 10 -4 -3  0 -1 -1 -2 -1 -2 -3  2 -4  0  0 -1
I -1 -4 -3 -4 -2 -3 -4 -4 -4  5  2 -3  2  0 -3 -3 -1 -3 -1  4 -4 -3 -1
L -2 -3 -4 -4 -2 -2 -3 -4 -3  2  5 -3  3  1 -4 -3 -1 -2 -1  1 -4 -3 -1
K -1  3  0 -1 -3  2  1 -2  0 -3 -3  6 -2 -4 -1  0 -1 -3 -2 -3  0  1 -1
M -1 -2 -2 -4 -2  0 -2 -3 -1  2  3 -2  7  0 -3 -2 -1 -1  0  1 -3 -1 -1
F -3 -3 -4 -5 -2 -4 -3 -4 -1  0  1 -4  0  8 -4 -3 -2  1  4 -1 -4 -4 -2
P -1 -3 -2 -1 -4 -1 -1 -2 -2 -3 -4 -1 -3 -4 10 -1 -1 -4 -3 -3 -2 -1 -2
S  1 -1  1  0 -1  0 -1  0 -1 -3 -3  0 -2 -3 -1  5  2 -4 -2 -2  0  0 -1
T  0 -1  0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1  2  5 -3 -2  0  0 -1  0
W -3 -3 -4 -5 -5 -1 -3 -3 -3 -3 -2 -3 -1  1 -4 -4 -3 15  2 -3 -5 -2 -3
Y -2 -1 -2 -3 -3 -1 -2 -3  2 -1 -1 -2  0  4 -3 -2 -2  2  8 -1 -3 -2 -1
V  0 -3 -3 -4 -1 -3 -3 -4 -4  4  1 -3  1 -1 -3 -2  0 -3 -1  5 -4 -3 -1
B -2 -1  4  5 -3  0  1 -1  0 -4 -4  0 -3 -4 -2  0  0 -5 -3 -4  5  2 -1
Z -1  0  0  1 -3  4  5 -2  0 -3 -3  1 -1 -4 -1  0 -1 -2 -2 -3  2  5 -1
X -1 -1 -1 -1 -2 -1 -1 -2 -1 -1 -1 -1 -1 -2 -2 -1  0 -3 -1 -1 -1 -1 -1


6. Implementation details

The elements in the score matrix are calculated in another order than for instance in the software projects in Section 8.1, “Striped Smith-Waterman by Farrar” and Section 8.2, “swps3”. In the diagonalsw implementaion the elements will be calculatad as a diagonal vector ( of length 8 or 16 ) slides along the db sequence axis. Reaching the end it iterates the whole thing 8 or 16 positions further down the query sequence axis.

6.1. Rearranging the column vectors into diagonal vectors

The alignment cost is precalculated so that an alignment cost vector for a certain db sequence element can be loaded as a column. These values must later be transferend into a sliding diagonal vector. If k is the length of the vectors, one diagonal vector will contain values from k column vectors. Or a bit more formally:

Let k be the length of the vectors:

If the first load is the vector:

(0,0),(0,1),(0,2),...,(0,k)

and the n:th load 

(n,0),(n,1),(n,2),...,(n,k)

we need to create a vector with

(n,0),(n-1,1),...,(n-k,k)

6.1.1. SSE version

The problem can be solved by loading column vectors and moving the elements around with the _mm_blendv_epi8 instruction. First we initialize some variables ( byte-version, 16 elements per vector )

    const  __m128i mask1=_mm_set_epi8(0,0,0,0,0,0,0,0,255,255,255,255,255,255,255,255);
    const  __m128i mask2=_mm_set_epi8(0,0,0,0,255,255,255,255,0,0,0,0,255,255,255,255);
    const  __m128i mask3=_mm_set_epi8(0,0,255,255,0,0,255,255,0,0,255,255,0,0,255,255);
    const  __m128i mask4=_mm_set_epi8(0,255,0,255,0,255,0,255,0,255,0,255,0,255,0,255);
    __m128i v0, v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12, v13, v14, v15;

Then for each step the variable v_score_load is loaded with the new column elements.

v15 = v_score_load;
v7 = _mm_blendv_epi8(v7,v15,mask1);
v3 = _mm_blendv_epi8(v3,v7,mask2);
v1 = _mm_blendv_epi8(v1,v3,mask3);
v0 = _mm_blendv_epi8(v0,v1,mask4);
v_diagonal = v0;

In the next step the variable name numbers in v0,v1,v3,v7,v15 are incremented by 1 and adjusted to be in the range 0 to 15. In other words: newnumber = ( oldnumber + 1 ) modulo 16.

v0 = v_score_load;
v8 = _mm_blendv_epi8(v8,v0,mask1);
v4 = _mm_blendv_epi8(v4,v8,mask2);
v2 = _mm_blendv_epi8(v2,v4,mask3);
v1 = _mm_blendv_epi8(v1,v2,mask4);
v_diagonal = v1;

After 16 iterations the v_diagonal will start to contain the correct diagonal values. This algorithm has one column load per iteration. It can be sligthly modified to to use two column loads per iteration, leading to one _mm_blendv_epi8 instruction less per iteration.

6.1.2. Altivec version

A different method is used for the Altivec version. Here the instruction vec_perm is used. The values are moving through some queue vectors. First we initialize some variables ( this is the word-version, 8 elements per vector )

    vector unsigned short   v_score;
    vector unsigned short   v_score_q1;
    vector unsigned short   v_score_q2;
    vector unsigned short   v_score_q3;
    vector unsigned short   v_score_load; 
    vector unsigned char    queue1_to_score  = (vector unsigned char)(16,17,2,3,4,5,6,7,8,9,10,11,12,13,14,15);
    vector unsigned char    queue2_to_queue1 = (vector unsigned char)(0,1,18,19,4,5,6,7,8,9,10,11,12,13,14,15);
    vector unsigned char    queue3_to_queue2 = (vector unsigned char)(16,16,16,16,16,21,16,0,16,1,16,2,16,3,16,4);
    vector unsigned char    queue3_with_load = (vector unsigned char)(23,5,6,7,8,25,9,10,11,27,12,13,29,14,31,16);

Then for each step the variable v_score_load is loaded with the new column elements.

        v_score    = vec_perm(v_score_q1, v_score_load, queue1_to_score);
        v_score_q1 = vec_perm(v_score_q2, v_score_load, queue2_to_queue1);
        v_score_q2 = vec_perm(v_score_q3, v_score_load, queue3_to_queue2);
        v_score_q3 = vec_perm(v_score_q3, v_score_load, queue3_with_load);

v_score will contain the diagonal elements.

6.2. Parallelization

As the sequences in the db sequence database can be analyzed independently, diagonalsw could be parallized in a straightforward way. The initial attempt was done with a tbb::pipeline from the open source library Intel Intel® Threading Building Blocks (TBB). It doesn't scale linearly to the number of cores yet. It is a bit less. Right now there is roughly 3.4 times speedup on a Intel Core2 Quad. Maybe our threading implementation needs some more fine tuning or we need to implement it in another way.

You could test the parallelization speedup by adding a -D ( --disable-threading ). If it does not scale linearly to the number of cores you will get a better computation speeds if you run multiple processes instead.

7. Performance - benchmarked computation speed

This has not been investigated that much. Up to 9 GigaCells/second has been seen on a "Intel(R) Core(TM)2 Quad CPU Q9300 @ 2.50GHz" with a binary built with the Intel Compiler. Note that the computation speed depends strongly how often the byte function overflows and also on the length of the query sequence. See also Example 4, “diagonalsw with benchmarking” and Example 5, “diagonalsw with verbose benchmarking”.

8. Related projects

8.1. Striped Smith-Waterman by Farrar

In the article published 2006, Striped Smith-Waterman speeds database searches six times over other SIMD implementations [farrar:2006], Michael Farrar describes a new clever way calculate the score matrix. Here SIMD vectors don't contain adjacent matrix elements, instead the matrix is striped. A SSE version was included in ssearch in the fasta software package ( see also Section 8.3, “ssearch in the fasta software package” ).

A SSE version was also released in the software package found at Smith-Waterman for the Cell Broadband Engine. That version also adds support for the Cell Broadband Engine. The software package is not open source but the source code can be downloaded. To get a permission to use the software you need to contact Michael Farrar ( read more in the file release/COPYRIGHT from the striped.tgz ).

8.2. swps3

swps3 is an open source reimplementation of Farrar's algorithm on IBM Cell/BE and 86/SSE by Adam Szalkowski, Christian Ledergerber, Philipp Krähenbühl and Christophe Dessimoz. The software is presented in the technical note SWPS3 - fast multi-threaded vectorized Smith-Waterman for IBM Cell/B.E. and 86/SSE [szalkowski:2008]. It is licensed under the MIT license. Unfortunately swps3 ( as of 2009-07-26 ) contains a bug making it sometimes calculate the wrong result. The bug has been reported to Adam Szalkowski who acknowledges that the bug is reproducible.

8.3. ssearch in the fasta software package

ssearch ( as of 2009-08-01 named ssearch_35 ) in the fasta software package implements the Smith-Waterman algorithm. This software is not open source but the source code can be downloaded. To get a permission to use the software you need to contact David Hudson, Assistant Provost for Research, University of Virginia ( read more in the file fasta-35.4.9/COPYRIGHT ). The SSE version is written by Farrar ( see also Section 8.1, “Striped Smith-Waterman by Farrar” ) and the AltiVec version is written by Erik Lindahl. The Altivec version is the same as the one found in diagonalsw.

References

[farrar:2006] MichaelFarrar “Striped Smith-Waterman speeds database searches six times over other SIMD implementations”. Bioinformatics 23(2):156-161, 2006.

[szalkowski:2008] AdamSzalkowski, Christian, Ledergerber Philipp, Krähenbühl Christophe, Dessimoz “SWPS3 - fast multi-threaded vectorized Smith-Waterman for IBM Cell/B.E. and 86/SSE”. BMC Research Notes, 1:107, 2008.


diagonalsw is hosted at

Get diagonalsw at SourceForge.net. Fast, secure and Free Open Source software downloads