- Bioinformatics with Python Cookbook
- Tiago Antao
- 1044字
- 2025-02-27 03:42:09
How to do it...
Let's take a look at the following steps:
- We will start by inspecting the description of all of the the sequences on the reference genome FASTA file:
from Bio import SeqIO
genome_name = 'PlasmoDB-9.3_Pfalciparum3D7_Genome.fasta'
recs = SeqIO.parse(genome_name, 'fasta')
for rec in recs:
print(rec.description)
This code should look familiar from the previous chapter, Chapter 2, Next-Generation Sequencing. Let's take a look at part of the output:

Different genome references will have different description lines, but they will generally have important information. In this example, you can see that we have chromosomes, mitochondria, and apicoplast. We can also view chromosome sizes, but we will take the value from the sequence length instead.
- Let's parse the description line to extract the chromosome number. We will retrieve the chromosome size from the sequence, and compute the GC content across chromosomes on a window basis:
from Bio import SeqUtils
recs = SeqIO.parse(genome_name, 'fasta')
chrom_sizes = {}
chrom_GC = {}
block_size = 50000
min_GC = 100.0
max_GC = 0.0
for rec in recs:
if rec.description.find('SO=chromosome') == -1:
continue
chrom = int(rec.description.split('_')[1])
chrom_GC[chrom] = []
size = len(rec.seq)
chrom_sizes[chrom] = size
num_blocks = size // block_size + 1
for block in range(num_blocks):
start = block_size * block
if block == num_blocks - 1:
end = size
else:
end = block_size + start + 1
block_seq = rec.seq[start:end]
block_GC = SeqUtils.GC(block_seq)
if block_GC < min_GC:
min_GC = block_GC
if block_GC > max_GC:
max_GC = block_GC
chrom_GC[chrom].append(block_GC)
print(min_GC, max_GC)
We have performed a windowed analysis of all chromosomes, similar to what you have seen in the previous chapter, Chapter 2, Next-Generation Sequencing. We started by defining a window size of 50 kbp. This is appropriate for P. falciparum (feel free to vary its size), but you will want to consider other values for genomes with chromosomes that are orders of magnitude different from this.
Note that we are re-reading the file. With such a small genome, it would have been feasible (in step one) to do an in-memory load of the whole genome. By all means, feel free to try this programming style for small genomes – it's faster! However, this code is more generalized for larger genomes.
- Note that in the for loop, we ignore the mitochondrion and apicoplast by parsing the SO entry to the description. The chrom_sizes dictionary will maintain the size of chromosomes.
The chrom_GC dictionary is our most interesting data structure and will have a list of a fraction of the GC content for each 50 kbp window. So, for chromosome 1, which has a size of 640,851 bp, there will be 14 entries because this chromosome size has 14 blocks of 50 kbp.
- Now, let's perform a genome plot of the GC distribution. We will use shades of blue for the GC content. However, for high outliers, we will use shades of red. For low outliers, we will use shades of yellow:
from reportlab.lib import colors
from reportlab.lib.units import cm
from Bio.Graphics import BasicChromosome
chroms = list(chrom_sizes.keys())
chroms.sort()
biggest_chrom = max(chrom_sizes.values())
my_genome = BasicChromosome.Organism(output_format="png")
my_genome.page_size = (29.7*cm, 21*cm)
telomere_length = 10
bottom_GC = 17.5
top_GC = 22.0
for chrom in chroms:
chrom_size = chrom_sizes[chrom]
chrom_representation = BasicChromosome.Chromosome ('Cr %d' % chrom)
chrom_representation.scale_num = biggest_chrom
tel = BasicChromosome.TelomereSegment()
tel.scale = telomere_length
chrom_representation.add(tel)
num_blocks = len(chrom_GC[chrom])
for block, gc in enumerate(chrom_GC[chrom]):
my_GC = chrom_GC[chrom][block]
body = BasicChromosome.ChromosomeSegment()
if my_GC > top_GC:
body.fill_color = colors.Color(1, 0, 0)
elif my_GC < bottom_GC:
body.fill_color = colors.Color(1, 1, 0)
else:
my_color = (my_GC - bottom_GC) / (top_GC -bottom_GC)
body.fill_color = colors.Color(my_color,my_color, 1)
if block < num_blocks - 1:
body.scale = block_size
else:
body.scale = chrom_size % block_size
chrom_representation.add(body)
tel = BasicChromosome.TelomereSegment(inverted=True)
tel.scale = telomere_length
chrom_representation.add(tel)
my_genome.add(chrom_representation)
my_genome.draw('falciparum.png', 'Plasmodium falciparum')
The first line converts the return of the keys method to a list. This was redundant in Python 2, but not in Python 3, where the keys method has a specific dict_keys return type.
We will draw the chromosomes in order (hence the sort). We will need the size of the biggest chromosome (14, in P. falciparum) to make sure that the size of chromosomes is printed with the correct scale (the biggest_chrom variable).
We then create an A4-sized representation of an organism with a PNG output. Note that we will draw very small telomeres of 10 bp. This will produce a rectangular-like chromosome. You can make the telomeres bigger, giving it a roundish representation, or you may have the arguably better idea of using the correct telomere size for your species.
We declare that anything with a GC content below 17.5 percent or above 22.0 percent will be considered an outlier. Remember that for most other species, this will be much higher.
We then print these chromosomes: they are bounded by telomeres and composed of 50 kbp chromosome segments (the last segment is sized with the remainder). Each segment will be colored in blue, with a red-green component based on the linear normalization between two outlier values. Each chromosome segment will either be 50 kbp, or potentially smaller if it's the last one of the chromosome. The output is shown in the following diagram:

Figure 1: The 14 chromosomes of Plasmodium falciparum, color-coded with the GC content (red is more than 22 percent, yellow less than 17 percent, and the blue shades represent a linear gradient between both numbers)
- Finally, you can print the image inline in the Notebook:
from IPython.core.display import Image
Image("falciparum.png")