microTaboo by MohammedAlJaff

Overview

A common challenge in bioinformatics is to identify unique sequences of a given size, where uniqueness is defined as the absence of other sequences that differ by less than a number of mismatches. Applications include finding sequences unique to a pathogen for infection diagnostics; identifying unique CRISPR target candidate regions and detecting phage or viral insertions. microTaboo is a tool that addresses these challenges and allows for efficient and extensive sequence mining of entire genomes for sequences up to 100 nucleotides in size in nucleotide space. In addition, microTaboo can also be used to find sequences in a FASTA file that only occur once. Moreover, if you want to find unique sequences that not only occur once in your FASTA file, but differ from any other sequence by a number of mismatches, then microTaboo should be your tool of choice. All output results come in a two column comma separated values (position, sequence) format.

Example:
>NC_000913.3 E.coli str K.12 ...
566070,GTACCCG...
566071,TACCCGG...
566072,ACCCGGA...

microTaboo is able to run on Windows, Mac OS and Linux, running Java 7 or higher. Additionally, for smaller tasks (say FASTA files under 10mb), microTaboo can be run on a laptops/desktop computers with 8 or more gigabytes of RAM.

The key concept

Definition: The k-disjoint problem

(set-theoretic formulation)
Given two sets A and B containing sequences of length W, find all sequences X in A such that X is more than k mismatches away from any sequence Y in B. The result one obtains is the so-called k-disjoint set of A and B. The complement of the result is the k-intersection of A and B, i.e the set of sequences in A that are at most k mismatches away from something in B.

Alternatively

("strings and index" formulation)
Given two fixed positive integer parameter values W and k as well as two strings P and T where P has length n and T has length m, the k-disjoint problem consists of finding all positions i (0<i<n+1) which have the property that a W length substring of P beginning at i is at least k mismatches away from any W long substring of T. The k-intersection would then be all other indices j in P. These indices have the property that a W long substring beginning at each of these j:s would have a W long substring somewhere in T which is at most k mismatches away.

How to use microTaboo to solve your k-disjoint problem

Note: microTaboo requires Java SE Runtime Environment 7

In this section, we will show you how to use microTaboo to solve your k-disjoint problem. Now you might be thinking: "I don't have one". That's okay, we got you covered. We have prepared a folder containing three example runs. You can find and download them by clicking the right-most button at the top of the page or in the repository. Each example run contains microTaboo, all necessary folders together with the FASTA files inside of these. All you need to do is to download and follow the readme, first in the main folder and then in the respective sub-folders. These are meant to check if microTaboo runs properly on your system and also gives you some practice using microTaboo.

Example runs aside, below we give a step-by-step guide for you to run microTaboo on your own FASTA files.
Here you have to decide on what microTaboo should do for you. If you want it to find a k-disjoint look at the instructions directly below. Otherwise, if your aim is to find sequences that occur only once in a FASTA file, look at the next section.

1. Create a folder/directory, you can name this folder/directory whatever you want as long as you know where it is. For the sake of example, we call our folder/directory F.

2. Download the microTaboo.jar file from the website and place it in the folder F.

3. Create three new (sub) folders/directories inside our main folder F. For simplicity and consistency, we call these new folders A, B and R.

4. Inside A, we should now put the FASTA files that we want our sequences to be found in. Inside the folder B we should place all FASTA files we do not want all sequences not to be inside in. As an added benefit microTaboo is able to deal with subfolders inside the folder B so if you would like to group your FASTA files, you can. 5. To run microTaboo, open a command line tool and navigate to the folder F and type the below formula:

> java -Xmx[a number]m -jar microTaboo.jar [name of folder A] [name of folder B] [name of folder R] [sequence length W] [number of mismatches k] d/i/a [number of cores] s/m

For our example, we would write the following:
> java -Xmx6000m -jar microTaboo.jar A B R 20 3 d 3 s;
That is, we're specifying that we would like microTaboo to give us the k-disjoint, where k=3, of the contents in A and B and put the result files into the subfolder R. In other words, we would like microTaboo to give us a file containing all sequences of length 20 in A that are at least 3 mismatches away from anything/any sequence in B.

There you go. You’ve hopefully now run microTaboo successfully on your own FASTA files to solve your k-disjoint or k-interesction problem.

How to use microTaboo to find sequences in a FASTA file that occur only once

In this section, we'll show you how microTaboo can be used to find all sequences of a given length in a FASTA file, eg a whole genome sequence FASTA, that occur only once. In other words, here we'll try to find only unique sequences in a FASTA file relative to it self. Below are instructions on how we could do this with microTaboo.

1. Create two separate folders (A and B), each one containing the FASTA file(s) for the same organism of interest. Also create a third folder (R) where the result should be saved. We call these folders A, B and R for simplicity.

2. Run microTaboo from the command line by typing:
java -Xmx[#]m -jar path/microTaboo.jar path/A path/B path/R [#w] [#k] [d] [#cores] [m]

-Xmx[#]m determines the maximum amount of memory (in MBs) microTaboo can use.
The path to microTaboo.jar, A, B and R can be either relative or absolute.
#w stands for word length (in bases) and should be a multiple of 3, 4 or 5 and atleast 6. Using a mutliple of 5 is optimal
#k is the mismatch threshold, can be any number greater than or equal to 0. It is recommended to not use #k greater than 5.
d means disjoint and which corresponds to unique sequences.
#cores decides how many cores microTaboo should use, minimum is 1.
m is a flag which is necessary when running a FASTA file against itself.

Example:
java -Xmx6000m -jar path/microTaboo.jar path/A path/B path/R 20 2 d 5 m
Means finding all unique sequnces within 2 mismatches of length 20 bases while using 5 cores and allowing microTaboo to utilize a maximum of 6GB of RAM.

Note on the presence of uncertain bases in a genome/FASTA-file.

When taking into account of unknown/uncertain bases (N), microTaboo follows the self-imposed convention that N compared to N gives 1 mismatch. Because of this, sequences containing many unknown bases, ie many N’s, will have a high mismatch count when compared to any other sequence. In the extreme case where we have a W nt sequence of only Ns; when compared to itself, the number of mismatches will be W. If a microTaboo run is initiated with a mismatch parameter k < W, these sequences will be included in either the disjoint set or unique set.

Additionally, should unknown bases other than N's be present in an input FASTA file, microTaboo will convert them to N's and treat them as such.

Note on RAM limitaitons and FASTA file size limitations.

-FASTA files larger than 50Mb needs to be chopped up and given FASTA headears if these are to be placed in B.
-RAM allocation should be based on your largest FASTA file, give microTaboo 1GB per Mb in your largest file. -For FASTA files placed in the "A" directory, ie your query files, make sure they are less than 500mb. Otherwise, it's recomended to chop it up with appropriate FASTA headers.

Note on accepted file types

Make sure all files in query directory (A) and "taboo" directory (B) are files in FASTA format with either of following file endings: .txt, .fna or .fasta. Also, make sure there are no other files than the desired FASTA files in the directories.

microTaboo

A general k-disjoint solver for Bioinformatic pipelines.

Overview

The key concept

How to use microTaboo to solve your k-disjoint problem

How to use microTaboo to find sequences in a FASTA file that occur only once