Bioinformatics: managing BLAST data sources

Continuing our series on bioinformatics tools, we’ll look at how to manage data sources for the BLAST tool. If you missed my last post, you should read it first to get an introduction to BLAST.

You will learn how to:

Extract raw sequence data from a preformatted BLAST database
Create a database with your own data and use it to perform BLAST queries
Create aliases for handling multiple BLAST databases at once

Prerequisites

A medium (4GB RAM or bigger) Exoscale instance running Ubuntu 16.04 LTS
A working copy of NCBI’s BLAST software suite. (If you haven’t done it yet, take a look at our previous guide on setting up BLAST on Exoscale).

Introduction

If you read my last tutorial, you should already know how to set-up the BLAST software suite for your own personal use. I will now show you how to perform many different tasks that will help you work with any kind of sequence data you may have at hand.

Aside from the programs that perform the actual BLAST search, the BLAST software suite contains many utilities that are helpful for extracting sequence data (blastdbcmd), and creating (makeblastdb) and organizing (blastdb_aliastool) BLAST databases. Let’s review each of these utilities along with examples of the common scenarios that make use of them.

Extract sequence data from a BLAST database with blastdbcmd

Let’s recall that for BLAST to be able to perform at the speed intended, it needs to look inside a preformatted database where all the sequence data is optimized for fast retrieval.

This kind of database can be generated from an existing corpus of sequence data (as I will show you later on this post) but NCBI conveniently offers a comprehensive list of them for download. These preformatted databases are available through their FTP site at the following path:

ftp://ftp.ncbi.nlm.nih.gov/blast/db/

You can read a description of the files found in that directory if you have a look at the README file that’s also stored in there. For the purpose of this tutorial, we are going to pick out a small dataset but all the examples shown here should work out the same way regardless of the size of the database you choose to use.

Download and extract the database for all the SwissProt sequence data by issuing the following commands:

# Make sure the $BLASTDB environmental variable is set and points to a valid directory
cd $BLASTDB

# Download the file
wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/swissprot.tar.gz

# Extract its contents and remove the compressed source
tar -xzvf swissprot.tar.gz
rm swissprot.tar.gz

After this, your $BLASTDB directory should have the following files:

swissprot.00.phr  swissprot.00.pnd  swissprot.00.pog  swissprot.00.ppi  swissprot.00.psi  swissprot.pal
swissprot.00.pin  swissprot.00.pni  swissprot.00.ppd  swissprot.00.psd  swissprot.00.psq

And you should be ready to make use of the SwissProt database across the BLAST software suite.

What is SwissProt?

A quick explanation about the SwissProt initiative and its purpose. There are millions of protein sequences known so far and a very large portion of them came from automated tools and broad genomic/proteomic studies which generate a lot of data but very little information about the biological context involved.

Suppose you have a “new” protein on your hands from which you would like to infer more information. You would definitely try to BLAST it against the set of all known proteins in order to see the function of the other ones that are similar to yours and get a few leads about its purported biological function. If you actually were to perform this using the whole universe of known protein sequences, most of your results would look like:

Another copy of your same query protein that just happened to be reported on a different experiment
A very similar (1 - 2 single changes) protein pointing to a homologue of your protein that exists in a very close evolutionary lineage to the organism from where your protein came from
Proteins that are similar to your query but who don’t have a known biological function assigned to them. The only thing you know is that it was reported to be present within the results of another particular study.

None of these alternatives will give you valuable insights about the functional properties of your protein because it is being matched highly with other proteins that also lack the information you are looking for. It would be really cool if we could try our search against a carefully selected subset of proteins from which we have credible evidence of their biological role. This is what SwissProt was set up to create and it is now the largest source of annotated protein data in the world. Thankfully, it is also available for free.

Back to our example. We have all the SwissProt data ready but what if we need to take a look at the protein sequences that are stored inside this BLAST database?

That’s the functionality that the blastcmd command provides, and we will start with the most trivial example: dumping out all the sequences contained in the database into a FASTA formatted file.

Dumping all sequences inside of a BLAST database

Move to an empty directory in your instance and enter the following command:

blastdbcmd -db swissprot -entry all -out all_sp.fa

We have made use of the following parameters:

-db: The name of the BLAST database that we’re working with
-entry: The pattern that we are going to use to look for proteins with matching identifiers, or all for matching all of them
-out: The name of the file where the results are going to be written. If we don’t specify this parameter, the results will go to STDOUT (i.e. your console).

After a couple seconds you should find an all_sp.fa file with all the sequences from the specified database stored in FASTA format. You can check that out by running head all_sp.fa and getting back the first few lines, which should look like:

>gi|1043220439|sp|U3KPV4.1|A3LT2_HUMAN ...
MALKEGLRAWKRIFWR ...
>gi|1042851791|sp|Q0G819.2|PP2B_CAEEL ...
MASTSAGQNSAAKPDT ...

This will prove very useful for you, as there are many scenarios where we have access only to the preformatted BLAST files but need to have the raw sequence data as well.

A small tip: If you want to count how many sequences are stored inside all_sp.fa, you can do that with grep -c "^>" all_sp.fa. Recall that the FASTA format specifies that each sequence must be preceded by a header line that always starts with a “>” character; so we use grep to count out (-c) how many lines start with this character ("^>").

Dumping a select set of sequences from a BLAST database

You can make use of the -entry parameter in a different way: to filter out the entries you would like to extract from your database. Instead of writing all you could set up a pattern that will be matched against all the entries and only return the ones that are compliant with it. For instance, this is how you would query the database in order to get the sequence data for the protein named LEC_LUFAC:

blastdbcmd -db swissprot -entry "LEC_LUFAC"

Depending on the FASTA headers available in your database you may have a lot of metadata available for querying any given sequence. Among these, the most commonly used are NCBI sequence IDs (e.g. gi:1042851727), database specific accession numbers (e.g. SwissProt:Q6H647.2) and, as we saw previously, its gene/protein name. You can query as many sequences as you want (at once) by writing them as a list separated by commas and using it as the -entry parameter. Here’s an example:

blastdbcmd -db swissprot -entry "1042851727,Q6H647.2,LEC_LUFAC"

Try it out and you should get back the sequence data for those three proteins.

Let’s say you have a long list of queries that you’re interested in but you don’t want to bother putting all of them in the terminal one by one. Well, there’s a handy parameter called -entry_batch which tells the command to read your queries from a particular file you specify; its only requirement is to have every different query string in a different line on the file. This is really useful for plugging in lists of genes that were generated as part of your workflow or that you can export easily from a spreadsheet or another program.

Go ahead and create a file with the following 10 lines:

1042851767
1042851761
1042851759
A3DIE4.1
A3DH98.1
A3DEX5.1
A3DC74.1
PTC3_CAEEL
MALR1_MOUSE
RGTD_RHIL3

Save it with any name you want, we will use query.file for this example. Then let’s invoke blastdbcmd like:

blastdbcmd -db swissprot -entry_batch query.file

And you should get back all sequence data from these ten query strings.

Changing the output format of blastdbcmd

By default, the results that come out of ´blastdbcmd´ are formatted following the guidelines defined by the FASTA format. But you can change this as well, using the -outfmt parameter.

-outfmt is going to fill the results according to a string, which can have any format you want and which is going to be filled according to the format specifiers that you define into it. These specifiers start with an % character and can appear anywhere and as many times as you want inside your format string.

Here are the most commonly used format specifiers:

%f - Output all sequence data and metadata in FASTA format (this is the default behavior of the command)
%s - Sequence data
%a - Database specific accession ID
%g - NCBI sequence ID (i.e. gi)
%i - Sequence ID
%l - Sequence length

So if you are only interested in getting a list of matching sequence IDs and their length, you could do it like:

blastdbcmd -db swissprot -entry_batch query.file -outfmt "%g Length --> %l"

And you should get back something like:

1042851767 Length --> 347
1042851768 Length --> 347
1042851761 Length --> 534
1042851759 Length --> 413
...

This was just an example to show you the flexibility provided by format strings, and this is a common pattern found along many command line utilities, so I’d encourage you to take your time to and get familiar with it.

What if you wanted to include a newline character as part of the format string? You can’t just press Enter because that would get your command to start running immediately. Instead, here’s a quick tip, try pressing Ctrl+V then Ctrl+J. This is how you tell your terminal to input a literal newline character into your command line, without triggering the default behavior associated with an Enter.

Querying the available databases and their information

There’s another couple functions that the blastdbcmd command can perform for you.

Thanks to the many different files that the BLAST suite uses to store its data, it is easy for your $BLASTDB directory to quickly become a mess. blastdbcmd lets you get a list of your currently active databases, by using the -list parameter:

blastdbcmd -list $BLASTDB

This should output the databases that are installed in your $BLASTDB directory and that are available to any of the programs that are part of the BLAST suite.

Another interesting feature of this command allows us to get statistical information about a particular BLAST database. This is accomplished through the -info parameter, which is used like this:

blastdbcmd -db swissprot -info

Go ahead and try it, you should see some useful information about the swissprot database.

Create your own BLAST database using makeblastdb

We’ve seen a lot on how to work with previously existing BLAST databases but there are many scenarios where you would want to provide your own dataset to exploit the fast search functionality that the BLAST algorithm implements.

The BLAST suite comes with a command line utility called makeblastdb which is designed to help you achieve this goal. This is a fundamental step required for the BLAST workflow to perform as intended, but it is often overlooked by many researchers because most of them already work with a preformatted database. When you are working with recently released data or something that you’d like to keep private for the time being, you have no choice but to create your own local BLAST data warehouse.

First of all, we are going to download some raw sequence data that will serve as a starting point for us to create the database. You could use your own, of course, but for this tutorial we are going to be using the full set of nucleotide sequences from the Drosophila Melanogaster genome.

You can find this resource (as well as many others) at the following path on NCBI’s FTP service:

ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/

Let’s download and extract the corresponding file. In an empty directory run:

# Download the file
wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/drosoph.nt.gz

# Decompress it (or "inflate" it)
gunzip drosoph.nt.gz

Using the grep -c tip showed previously, you should be able to find out how many sequence entries are contained within the file. (1,170 or something close).

We are now going to convert this set of sequences into a preformatted BLAST database and this is as simple as calling makeblastdb with the following parameters:

-dbtype: The type of sequence that we are reading from the file. Options are nucl and prot, for nucleotide and protein sequences, respectively.
-in: The file where all input sequences are stored.
-out: The path where the resulting BLAST database is going to be stored.
-parse_seqids: An optional argument that tells the command to try to infer sequence IDs and other metadata from the FASTA headers of each available sequence.
-title: An optional title commonly used to better describe the contents inside of the resulting database.

Go on and execute:

makeblastdb -dbtype nucl -in drosoph.nt -out $BLASTDB/drosophila.nt -parse_seqids -title "Drosophila melanogaster - Complete nucleotide sequence set"

After a few moments, you should have a database named drosophila.nt stored inside in your $BLASTDB directory. You can verify this with blastdbcmd -list $BLASTDB to get something like:

$BLASTDB/drosophila.nt Nucleotide
$BLASTDB/swissprot Protein
$BLASTDB/swissprot.00 Protein

From here on, you can work with the tools of the BLAST suite using this newly provided database by referencing it using the -db parameter as usual, like: blastp -db drosophila.nt ....

Creating database aliases with blatsdb_aliastool

Picture the following scenario. You and your team are continuously getting new data from an ongoing sequencing project. You receive the first batch and set up an instance to work with it using BLAST. You tell your team (or setup your scripts) to use the database called Cancer_NT_Jan_2016, which you just set up with all the data you have at this moment.

Fast forward a month and you receive more results. Now you tell everyone that there’s another database called Cancer_NT_Feb_2016 and that they can use both. Then a couple days later, you find out that there were some refinements made to the first dataset and many sequences were removed due to a better quality control. You pack up a new BLAST database and use Cancer_NT_Jan_2016_Rev_1 as its name, to avoid confusion, and then tell anyone what happened.

You probably see where I’m getting to. It is really easy for your BLAST database warehouse to become entangled among multiple files and revisions of the same data. This is a logistical problem that will not allow you to set up a foundation that your users and the tools that plug in to your pipeline can use as a reliable data source, due to the many adjustments that you need to make from time to time.

The BLAST suite helps you solve this kind of problem by allowing you to create aliases, that is, placeholder names that look like and work like BLAST databases but which point out to one (or many) different data sources. Using aliases, you could just tell your team to use Cancer_NT_Latest as their database name and they will always have the most relevant information available, regardless of whether you are adding or removing database sources from it.

To set up an alias we make use of the command blastdbalias_tool, which can take the following parameters:

-dblist: A list of all the databases (separated by spaces) that will become part of this particular alias definition.
-dbtype: The type of sequence data that you are dealing with (nucl or prot). You should not mix nucleotide and protein databases into a single alias. This is not enforced by the program, so it is up to you.
-title: A longer description for your generated alias.
-out: The name (and path) for the newly generated alias. Remember that it needs to be stored at $BLASTDB for it to work.

So, an example invocation that would help us solve our hypothetical scenario, would be:

blastdb_aliastool -dblist "Cancer_NT_Jan_2016_Rev_1 Cancer_NT_Feb_2016" -dbtype nucl -title "All current nucleotide data from our cancer research project." -out $BLASTDB/Cancer_NT_Latest

This was just an example use case but you could use aliases for other different things like:

Combine two or more databases into a single merged database that would make it easier to query all the data contained by them.
Create subsets and different combinations of databases which are relevant for your current research. For instance, suppose that your cancer project sampled different tissues many times over. You could create an alias for each type of tissue, each of them containing all the datasets pertaining to it.

Thank you for reading this post! I hope it’s been helpful to you. Stay tuned for the next part, where I’ll show you how to deploy a parallel implementation of the BLAST suite that will allow you to have multiple Exoscale nodes working together and greatly improve the number of queries you would be able to solve at once.