书名：Bioinformatics with Python Cookbook
作者名：Tiago Antao
本章字数：1390字
更新时间：2025-02-27 03:42:03

How to do it...

Before we start coding, let's take a look at the FASTQ file, in which you will have many records, as shown in the following code:

@SRR003258.1 30443AAXX:1:1:1053:1999 length=51
 ACCCCCCCCCACCCCCCCCCCCCCCCCCCCCCCCCCCACACACACCAACAC
 +
 =IIIIIIIII5IIIIIII>IIII+GIIIIIIIIIIIIII(IIIII01&III

Line 1 starts with @, followed by a sequence identifier and a description string. The description string will vary from a sequencer or a database source, but will normally be amenable to automated parsing.

The second line has the sequence DNA, which is just like a FASTA file. The third line is a +, sometimes followed by the description line on the first line.

The fourth line contains quality values for each base that's read on line two. Each letter encodes a Phred quality score (http://en.wikipedia.org/wiki/Phred_quality_score), which assigns a probability of error to each read. This encoding can vary a bit among platforms. Be sure to check for this on your specific platform.

Let's take a look at the following steps:

Let's open the file:

import gzip
from Bio import SeqIO
recs = SeqIO.parse(gzip.open('SRR003265.filt.fastq.gz'),'rt', encoding='utf-8'), 'fastq')
rec = next(recs)
print(rec.id, rec.description, rec.seq)
print(rec.letter_annotations)

We will open a GZIP file so that we can use the Python gzip module. We will also specify the fastq format. Note that some variations in this format will impact the interpretation of the Phred quality scores. You may want to specify a slightly different format. Refer to http://biopython.org/wiki/SeqIO for all formats.

You should usually store your FASTQ files in a compressed format. Not only do you gain a lot of disk space, as these are text files, but you probably also gain some processing time. Although decompressing is a slow process, it can still be faster than reading a much bigger (uncompressed) file from a disk.

We print the standard fields and quality scores from the previous recipe into rec.letter_annotations. As long as we choose the correct parser, Biopython will convert all the Phred encoding letters to logarithmic scores, which we will use soon.

For now, don't do this:

recs = list(recs) # do not do it!

Although this might work with some FASTA files (and with this very small FASTQ file), if you do something like this, you will allocate memory so that you can load the complete file in memory. With an average FASTQ file, this is the best way to crash your computer. As a rule, always iterate over your file. If you have to perform several operations over it, you have two main options. The first option is perform a single iteration or all operations at once. The second option is open a file several times and repeat the iteration.

Now, let's take a look at the distribution of nucleotide reads:

from collections import defaultdict
recs = SeqIO.parse(gzip.open('SRR003265.filt.fastq.gz', 'rt', encoding='utf-8'), 'fastq')
cnt = defaultdict(int)
for rec in recs:
    for letter in rec.seq:
        cnt[letter] += 1
tot = sum(cnt.values())
for letter, cnt in cnt.items():
    print('%s: %.2f %d' % (letter, 100. * cnt / tot, cnt))

We will reopen the file again and use defaultdict to maintain a count of nucleotide references in the FASTQ file. If you have never used this Python standard dictionary type, you may want to consider it because it removes the need to initialize dictionary entries, assuming default values for each type.

Note that there is a residual number for N calls. These are calls in which a sequencer reports an unknown base. In our FASTQ file example, we have cheated a bit because we used a filtered file (the fraction of N calls will be quite low). Expect a much bigger number of N calls in a file that came out of the sequencer unfiltered. In fact, you can even expect something more with regards to the spatial distribution of N calls.

Let's plot the distribution of Ns according to its read position:

%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
recs = SeqIO.parse(gzip.open('SRR003265.filt.fastq.gz', 'rt', encoding='utf-8'), 'fastq')
n_cnt = defaultdict(int)
for rec in recs:
    for i, letter in enumerate(rec.seq):
        pos = i + 1
        if letter == 'N':
            n_cnt[pos] += 1
seq_len = max(n_cnt.keys())
positions = range(1, seq_len + 1)
fig, ax = plt.subplots(figsize=(16,9))
ax.plot(positions, [n_cnt[x] for x in positions])
ax.set_xlim(1, seq_len)

The first line only works on IPython and Jupyter Notebook (you should remove it on a standard Python implementation) and it will inline any plots. We then import the seaborn library. Although we do not use it explicitly at this point, this library has the advantage of making matplotlib plots look better, because it tweaks the default matplotlib style.

We then open the file to parse again (remember that you do not use a list, but iterate again). We iterate through the file and get the position of any references to N. Then, we plot the distribution of Ns as a function of the distance from the start of the sequence:

Figure 1: The number of N calls as a function of the distance from the start of the sequencer read

You will see that until position 25, there are no errors. This is not what you will get from a typical sequencer output. Our example file is already filtered, and the 1,000 genomes filtering rules enforce that no N calls can occur before position 25.

While we cannot study the behavior of Ns in this dataset before position 25 (feel free to use one of your own unfiltered FASTQ files with this code in order to see how Ns distribute across the read position), we can see that after position 25, the distribution is far from uniform. There is an important lesson here, which is that the quantity of uncalled bases is position-dependent. So, what about the quality of reads?

Let's study the distribution of Phred scores (that is, the quality of our reads):

recs = SeqIO.parse(gzip.open('SRR003265.filt.fastq.gz', 'rt', encoding='utf-8'), 'fastq')
cnt_qual = defaultdict(int)
for rec in recs:
    for i, qual in enumerate(rec.letter_annotations['phred_quality']):
        if i < 25:
            continue
        cnt_qual[qual] += 1
tot = sum(cnt_qual.values())
for qual, cnt in cnt_qual.items():
    print('%d: %.2f %d' % (qual, 100. * cnt / tot, cnt))

We will start by reopening the file (again) and initializing a default dictionary. We then get the phred_quality letter annotation, but we ignore sequencing positions that are up to 24 base pairs from the start (because of the filtering of our FASTQ file, if you have an unfiltered file, you may want to drop this rule). We add the quality score to our default dictionary, and finally print it.

As a short reminder, the Phred quality score is a logarithmic representation of the probability of an accurate call. This probability is given as

. So, a Q of 10 represents a 90 percent call accuracy, 20 represents 99 percent call accuracy, and 30 will be 99.9 percent. For our file, the maximum accuracy will be 99.99 percent (40). In some cases, values of 60 are possible (99.9999 percent accuracy).

More interestingly, we can plot the distribution of qualities according to their read position:

recs = SeqIO.parse(gzip.open('SRR003265.filt.fastq.gz', 'rt', encoding='utf-8'), 'fastq')
qual_pos = defaultdict(list)
for rec in recs:
    for i, qual in enumerate(rec.letter_annotations['phred_quality']):
        if i < 25 or qual == 40:
           continue
        pos = i + 1
        qual_pos[pos].append(qual)
vps = []
poses = list(qual_pos.keys())
poses.sort()
for pos in poses:
    vps.append(qual_pos[pos])
fig, ax = plt.subplots(figsize=(16,9))
sns.boxplot(data=vps, ax=ax)
ax.set_xticklabels([str(x) for x in range(26, max(qual_pos.keys()) + 1)])

In this case, we will ignore both positions sequenced as 25 base pairs from the start (again, remove this rule if you have unfiltered sequencer data) and the maximum quality score for this file (40). However, in your case, you can consider starting your plotting analysis with the maximum. You may want to check the maximum possible value for your sequencer hardware. Generally, as most calls can be performed with maximum quality, you may want to remove them if you are trying to understand where quality problems lie.

Note that we are using seaborn's boxplot function; we are only using this because the output looks slightly better than the standard Matplotlib boxplot. If you prefer not to depend on seaborn, just use the stock matplotlib function. In this case, you will call ax.boxplot(vps) instead of sns.boxplot(data=vps, ax=ax).

As expected, the distribution is not uniform, as shown in the following screenshot:

Figure 2: The distribution of Phred scores as a function of the distance from the start of the sequencer read