书名：Bioinformatics with Python Cookbook
作者名：Tiago Antao
本章字数：1044字
更新时间：2025-02-27 03:42:09

How to do it...

Let's take a look at the following steps:

We will start by inspecting the description of all of the the sequences on the reference genome FASTA file:

from Bio import SeqIO
genome_name = 'PlasmoDB-9.3_Pfalciparum3D7_Genome.fasta'
recs = SeqIO.parse(genome_name, 'fasta')
for rec in recs:
    print(rec.description)

This code should look familiar from the previous chapter, Chapter 2, Next-Generation Sequencing. Let's take a look at part of the output:

Different genome references will have different description lines, but they will generally have important information. In this example, you can see that we have chromosomes, mitochondria, and apicoplast. We can also view chromosome sizes, but we will take the value from the sequence length instead.

Let's parse the description line to extract the chromosome number. We will retrieve the chromosome size from the sequence, and compute the GC content across chromosomes on a window basis:

from Bio import SeqUtils
recs = SeqIO.parse(genome_name, 'fasta')
chrom_sizes = {}
chrom_GC = {}
block_size = 50000
min_GC = 100.0
max_GC = 0.0
for rec in recs:
    if rec.description.find('SO=chromosome') == -1:
        continue
    chrom = int(rec.description.split('_')[1])
    chrom_GC[chrom] = []
    size = len(rec.seq)
    chrom_sizes[chrom] = size
    num_blocks = size // block_size + 1
    for block in range(num_blocks):
        start = block_size * block
        if block == num_blocks - 1:
            end = size
        else:
            end = block_size + start + 1
        block_seq = rec.seq[start:end]
        block_GC = SeqUtils.GC(block_seq)
        if block_GC < min_GC:
            min_GC = block_GC
        if block_GC > max_GC:
            max_GC = block_GC
        chrom_GC[chrom].append(block_GC)
print(min_GC, max_GC)

We have performed a windowed analysis of all chromosomes, similar to what you have seen in the previous chapter, Chapter 2, Next-Generation Sequencing. We started by defining a window size of 50 kbp. This is appropriate for P. falciparum (feel free to vary its size), but you will want to consider other values for genomes with chromosomes that are orders of magnitude different from this.

Note that we are re-reading the file. With such a small genome, it would have been feasible (in step one) to do an in-memory load of the whole genome. By all means, feel free to try this programming style for small genomes – it's faster! However, this code is more generalized for larger genomes.

Note that in the for loop, we ignore the mitochondrion and apicoplast by parsing the SO entry to the description. The chrom_sizes dictionary will maintain the size of chromosomes.

The chrom_GC dictionary is our most interesting data structure and will have a list of a fraction of the GC content for each 50 kbp window. So, for chromosome 1, which has a size of 640,851 bp, there will be 14 entries because this chromosome size has 14 blocks of 50 kbp.

Be aware of two unusual features of the P. falciparum genome: the genome is very AT-rich, that is, GC-poor. Therefore, the numbers that you will get will be very low. Also, chromosomes are ordered based on size (as is common), but starting with the smallest size. The usual convention is to start with the largest size (such as with genomes in humans).

Now, let's perform a genome plot of the GC distribution. We will use shades of blue for the GC content. However, for high outliers, we will use shades of red. For low outliers, we will use shades of yellow:

from reportlab.lib import colors
from reportlab.lib.units import cm
from Bio.Graphics import BasicChromosome

chroms = list(chrom_sizes.keys())
chroms.sort()
biggest_chrom = max(chrom_sizes.values())
my_genome = BasicChromosome.Organism(output_format="png")
my_genome.page_size = (29.7*cm, 21*cm)
telomere_length = 10
bottom_GC = 17.5
top_GC = 22.0
for chrom in chroms:
    chrom_size = chrom_sizes[chrom]
    chrom_representation = BasicChromosome.Chromosome ('Cr %d' % chrom)
    chrom_representation.scale_num = biggest_chrom
    tel = BasicChromosome.TelomereSegment()
    tel.scale = telomere_length
    chrom_representation.add(tel)
    num_blocks = len(chrom_GC[chrom])
    for block, gc in enumerate(chrom_GC[chrom]):
        my_GC = chrom_GC[chrom][block]
        body = BasicChromosome.ChromosomeSegment()
        if my_GC > top_GC:
            body.fill_color = colors.Color(1, 0, 0)
        elif my_GC < bottom_GC:
            body.fill_color = colors.Color(1, 1, 0)
        else:
            my_color = (my_GC - bottom_GC) / (top_GC -bottom_GC)
            body.fill_color = colors.Color(my_color,my_color, 1)
        if block < num_blocks - 1:
            body.scale = block_size
        else:
            body.scale = chrom_size % block_size
        chrom_representation.add(body)
    tel = BasicChromosome.TelomereSegment(inverted=True)
    tel.scale = telomere_length
    chrom_representation.add(tel)
    my_genome.add(chrom_representation)
my_genome.draw('falciparum.png', 'Plasmodium falciparum')

The first line converts the return of the keys method to a list. This was redundant in Python 2, but not in Python 3, where the keys method has a specific dict_keys return type.

We will draw the chromosomes in order (hence the sort). We will need the size of the biggest chromosome (14, in P. falciparum) to make sure that the size of chromosomes is printed with the correct scale (the biggest_chrom variable).

We then create an A4-sized representation of an organism with a PNG output. Note that we will draw very small telomeres of 10 bp. This will produce a rectangular-like chromosome. You can make the telomeres bigger, giving it a roundish representation, or you may have the arguably better idea of using the correct telomere size for your species.

We declare that anything with a GC content below 17.5 percent or above 22.0 percent will be considered an outlier. Remember that for most other species, this will be much higher.

We then print these chromosomes: they are bounded by telomeres and composed of 50 kbp chromosome segments (the last segment is sized with the remainder). Each segment will be colored in blue, with a red-green component based on the linear normalization between two outlier values. Each chromosome segment will either be 50 kbp, or potentially smaller if it's the last one of the chromosome. The output is shown in the following diagram:

Figure 1: The 14 chromosomes of Plasmodium falciparum, color-coded with the GC content (red is more than 22 percent, yellow less than 17 percent, and the blue shades represent a linear gradient between both numbers)

Biopython code has evolved over time, from before Python was such a fashionable language. In the past, the availability of libraries was quite limited. The usage of reportlab can be seen mostly as a legacy issue. I suggest that you learn just enough from it to use it with Biopython. If you are planning on learning a modern plotting library in Python, you will probably want to consider matplotlib, Bokeh, HoloViews, or Python's version of ggplot (or other visualization alternatives, such as Mayavi, Visualization Toolkit (VTK), or even Blender API).

Finally, you can print the image inline in the Notebook:

from IPython.core.display import Image
Image("falciparum.png")