How to do it...

Now, let's look at how we can search and fetch data from NCBI databases:

  1. We will start by importing the relevant module and configuring the email address:
from Bio import Entrez, SeqIO
Entrez.email = 'put@your.email.here'

We will also import the module to process sequences. Do not forget to put in the correct email address.

  1. We will now try to find the chloroquine resistance transporter (CRT) gene in Plasmodium falciparum (the parasite that causes the deadliest form of malaria) on the nucleotide database:
handle = Entrez.esearch(db='nucleotide', term='CRT[Gene Name] AND "Plasmodium falciparum"[Organism]')
rec_list = Entrez.read(handle)
if rec_list['RetMax'] < rec_list['Count']:
handle = Entrez.esearch(db='nucleotide', term='CRT[Gene Name] AND "Plasmodium falciparum"[Organism]', retmax=rec_list['Count'])
rec_list = Entrez.read(handle)

We will search the nucleotide database for our gene and organism (for the syntax of the search string, check the NCBI website). Then, we will read the result that is returned. Note that the standard search will limit the number of record references to 20, so if you have more, you may want to repeat the query with an increased maximum limit. In our case, we will actually override the default limit with retmax. The Entrez system provides quite a few sophisticated ways to retrieve large number of results (for more information, check the Biopython or NCBI Entrez documentation). Although you now have the IDs of all of the records, you still need to retrieve the records properly.

  1. Now, let's try to retrieve all of these records. The following query will download all matching nucleotide sequences from GenBank, which is 481, at the time of writing this book. You probably won't want to do this all the time:
id_list = rec_list['IdList']
hdl = Entrez.efetch(db='nucleotide', id=id_list, rettype='gb')

Well, in this case, go ahead and do it. However, be careful with this technique, because you will retrieve a large amount of complete records, and some of them will have fairly large sequences inside. You risk downloading a lot of data (which would be a strain both on your side and on NCBI servers).

There are several ways around this. One way is to make a more restrictive query and/or download just a few at a time and stop when you have found the one that you need. The precise strategy will depend on what you are trying to achieve. In
any case, we will retrieve a list of records in the GenBank format (which includes sequences, plus a lot of interesting metadata).

  1. Let's read and parse the result:
recs = list(SeqIO.parse(hdl, 'gb'))

Note that we have converted an iterator (the result of SeqIO.parse) to a list. The advantage of doing this is that we can use the result as many times as we want (for example, iterate many times over), without repeating the query on the server. This saves time, bandwidth, and server usage if you plan to iterate many times over. The disadvantage is that it will allocate memory for all records. This will not work for very large datasets; you might not want to do this conversion genome-wide as in the next chapter, Chapter 3, Working with Genomes. We will return to this topic in the last part of this book. If you are doing interactive computing, you will probably prefer to have a list (so that you can analyze and experiment with it multiple times), but if you are developing a library, an iterator will probably be the best approach.

  1. We will now just concentrate on a single record. This will only work if you used the exact same preceding query:
for rec in recs:
if rec.name == 'KM288867':
break
print(rec.name) print(rec.description)

The rec variable now has our record of interest. The rec.description file will contain its human-readable description.

  1. Now, let's extract some sequence features, which contain information such as gene products and exon positions on the sequence:
for feature in rec.features:
if feature.type == 'gene':
print(feature.qualifiers['gene'])
elif feature.type == 'exon':
loc = feature.location
print(loc.start, loc.end, loc.strand)
else:
print('not processed:\n%s' % feature)

If the feature.type is gene, we will print its name, which will be in the qualifiers dictionary. We will also print all the locations of exons. Exons, as with all features, have locations in this sequence: a start, an end, and the strand from where they are read. While all the start and end positions for our exons are ExactPosition, note that Biopython supports many other types of positions. One type of position is BeforePosition, which specifies that a location point is before a certain sequence position. Another type of position is BetweenPosition, which gives the interval for a certain location start/end. There are quite a few more position types; these are just some examples.

Coordinates will be specified in such a way that you will be able to retrieve the sequence from a Python array with ranges easily, so generally, the start will be one before the value on the record, and the end will be equal. The issue of coordinate systems will be revisited in future recipes.

For other feature types, we simply print them. Note that Biopython will provide a human-readable version of the feature when you print it.

  1. We will now look at the annotations on the record, which are mostly metadata that is not related to the sequence position:
for name, value in rec.annotations.items():
print('%s=%s' % (name, value))

Note that some values are not strings; they can be numbers or even lists (for example, the taxonomy annotation is a list).

  1. Last but not least, you can access the fundamental piece of information, the sequence:
print(len(rec.seq))

The sequence object will be the main topic of our next recipe.