书名：Bioinformatics with Python Cookbook
作者名：Tiago Antao
本章字数：2186字
更新时间：2025-02-27 03:42:06

How to do it...

Take a look at the following steps:

Let's start by plotting the distribution of variants across the genome in both files:

%matplotlib inline
from collections import defaultdict
import functools
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import vcf

def do_window(recs, size, fun):
 start = None
 win_res = []
 for rec in recs:
 if not rec.is_snp or len(rec.ALT) > 1:
 continue
 if start is None:
 start = rec.POS
 my_win = 1 + (rec.POS - start) // size
 while len(win_res) < my_win:
 win_res.append([])
 win_res[my_win - 1].extend(fun(rec))
 return win_res

wins = {}
size = 2000
names = ['centro.vcf.gz', 'standard.vcf.gz']
for name in names:
 recs = vcf.Reader(filename=name)
 wins[name] = do_window(recs, size, lambda x: [1])

We will start by performing the required imports (as usual, remember to remove the first line if you are not using the IPython Notebook). Before I explain the function, note what we are doing.

For both files, we will compute windowed statistics. We will divide our file, which includes 200,000 bp of data, in windows of size 2,000 (100 windows). Every time we find a biallelic SNP, we will add a one to the list related to this window in the window function.

The window function will take a VCF record (a SNP, rec.is_snp, that is not biallelic len (rec.ALT) == 1), determine the window where that record belongs (by performing an integer division of rec.POS by size), and extend the list of results of that window by the function passed to it as the fun parameter (which, in our case, is just a one).

So, now we have a list of 100 elements (each representing 2,000 base pairs). Each element will be another list that will have a one for each biallelic SNP found.

So, if you have 200 SNPs in the first 2,000 base pairs, the first element of the list will have 200 ones.

Let's continue as follows:

def apply_win_funs(wins, funs):
    fun_results = []
    for win in wins:
        my_funs = {}
        for name, fun in funs.items():
            try:
                my_funs[name] = fun(win)
            except:
                my_funs[name] = None
        fun_results.append(my_funs)
    return fun_results

stats = {}
fig, ax = plt.subplots(figsize=(16, 9))
for name, nwins in wins.items():
    stats[name] = apply_win_funs(nwins, {'sum': sum})
    x_lim = [i * size for i in range(len(stats[name]))]
    ax.plot(x_lim, [x['sum'] for x in stats[name]], label=name)
ax.legend()
ax.set_xlabel('Genomic location in the downloaded segment')
ax.set_ylabel('Number of variant sites (bi-allelic SNPs)')
fig.suptitle('Number of bi-allelic SNPs along the genome', fontsize='xx-large')

Here, we perform a plot that contains statistical information for each of our 100 windows. apply_win_funs will calculate a set of statistics for every window. In this case, it will sum all the numbers in the window. Remember that every time we find an SNP, we add 1 to the window list. This means that if we have 200 SNPs, we will have 200 1s; hence, summing them will return 200.

So, we are able to compute the number of SNPs per window in an apparently convoluted way. Why we perform things with this strategy will become apparent soon. However, for now, let's check the result of this computation for both files, as shown in the following graph:

Figure 5: The number of biallelic SNPs distributed of windows of 2,000 bp of size for an area of 200 kbp near the centromere (orange), and in the middle of chromosome (blue); both areas come from chromosome 3L for circa 100 Ugandan mosquitoes from the Anopheles 1,000 Genomes Project

Note that the amount of SNPs in the centromere is smaller than in the middle of the chromosome. This is expected because both calling variants in chromosomes are more difficult than calling in the middle. Also, there is probably less genomic diversity in centromeres. If you are used to humans or other mammals, you will find the density of variants obnoxiously high—that's mosquitoes for you!

Let's take a look at the sample-level annotation. We will inspect mapping quality zero (refer to https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_MappingQualityZeroBySample.php for details), which is a measure of how well sequences involved in calling this variant map clearly to this position. Note that there is also a MQ0 annotation at the variant level:

mq0_wins = {}
size = 5000

def get_sample(rec, annot, my_type):
    res = []
    samples = rec.samples
    for sample in samples:
        if sample[annot] is None: # ignoring nones
            continue
    res.append(my_type(sample[annot]))
    return res

for vcf_name in vcf_names:
recs = vcf.Reader(filename=vcf_name)
mq0_wins[vcf_name] = do_window(recs, size, functools.partial(get_sample, annot='MQ0', my_type=int))

Start inspecting this by looking at the last for; we will perform a windowed analysis by reading the MQ0 annotation from each record. We perform this by calling the get_sample function, which will return our preferred annotation (in this case, MQ0), which has been cast with a certain type (my_type=int). We use the partial application function here. Python allows you to specify some parameters of a function and wait for other parameters to be specified later. Note that the most complicated thing here is the functional programming style. Also, note that it makes it very easy to compute other sample-level annotations. Just replace MQ0 with AB, AD, GQ, and so on. You immediately have a computation for that annotation. If the annotation is not of the integer type, no problem; just adapt my_type. It's a difficult programming style if you are not used to it, but you will reap its benefits very soon.

Now, let's print the median and the top 75 percent percentile for each window (in this case, with a size of 5,000):

stats = {}
colors = ['b', 'g']
i = 0
fig, ax = plt.subplots(figsize=(16, 9))
for name, nwins in mq0_wins.items():
    stats[name] = apply_win_funs(nwins, {'median':np.median, '75': functools.partial(np.percentile, q=75)})
    x_lim = [j * size for j in range(len(stats[name]))]
    ax.plot(x_lim, [x['median'] for x in stats[name]], label=name, color=colors[i])
    ax.plot(x_lim, [x['75'] for x in stats[name]], '--', color=colors[i])
    i += 1
ax.legend()
ax.set_xlabel('Genomic location in the downloaded segment')
ax.set_ylabel('MQ0')
fig.suptitle('Distribution of MQ0 along the genome', fontsize='xx-large')

Note that we now have two different statistics on apply_win_funs (percentile and median). Again, we pass functions as parameters (np.median and np.percentile), with partial function application done on np.percentile. The result is as follows:

Figure 6: Median (continuous line) and 75th percentile (dashed) of MQ0 of sample SNPs distributed on windows of 5,000 bp of size for an area of 200 kbp near the centromere (blue) and in the middle of chromosome (green); both areas come from chromosome 3L for circa 100 Ugandan mosquitoes from the Anopheles 1,000 Genomes Project

For the standard.vcf.gz file, the median MQ0 is 0 (it's plotted at the very bottom and is almost unseen). This is good as it suggests that most sequences involved in the calling of variants map clearly to this area of the genome. For the centro.vcf.gz file, MQ0 is of poor quality. Furthermore, there are areas where the genotype caller cannot find any variants at all (hence the incomplete chart).

Let's compare heterozygosity with (DP), the sample-level annotation. Here, we will plot the fraction of heterozygosity calls as a function of the Sample Read Depth (DP) for every SNP. First, we will explain the result and then the code that generates it.

The following graph shows the fraction of calls that are heterozygous at a certain depth:

Figure 7: The continuous line represents the fraction of heterozygosite calls computed at a certain depth; in orange is the centromeric area; in blue is the "standard" area; the dashed lines represent the number of sample calls per depth; both areas come from chromosome 3L for circa 100 Ugandan mosquitoes from the Anopheles 1,000 Genomes Project

In the preceding graph, there are two considerations to take into account. At a very low depth, the fraction of heterozygote calls is biased, that is, low. This makes sense, as the number of reads per position does not allow you to make a correct estimate of the presence of both alleles in a sample. Therefore, you should not trust calls at very low depth.

As expected, the number of calls in the centromere is way lower than outside it. The distribution of SNPs outside the centromere follows a common pattern that you can expect in many datasets.

The code for this is as follows:

def get_sample_relation(recs, f1, f2):
    rel = defaultdict(int)
    for rec in recs:
        if not rec.is_snp:
            continue
        for sample in rec.samples:
            try:
                v1 = f1(sample)
                v2 = f2(sample)
                if v1 is None or v2 is None:
                    continue # We ignore Nones
                rel[(v1, v2)] += 1
            except:
                pass # This is outside the domain (typically None)
    return rel
rels = {}
for vcf_name in vcf_names:
    recs = vcf.Reader(filename=vcf_name)
    rels[vcf_name] = get_sample_relation(recs, lambda s: 1 if s.is_het else 0, lambda s: int(s['DP']))

Start by looking for the for loop. Again, we use functional programming; the get_sample_relation function will traverse all SNP records and apply two functional parameters. The first parameter determines heterozygosity, whereas the second parameter acquires the sample DP (remember that there is also a variant of DP).

Now, since the code is as complex as it is, I opted for a naive data structure to be returned by get_sample_relation: a dictionary where the key is a pair of results (in this case, heterozygosity and DP) and the sum of SNPs that share both values. There are more elegant data structures with different trade-offs. For this, there's SciPy spare matrices, pandas' DataFrames, or you may want to consider PyTables. The fundamental point here is to have a framework that is general enough to compute relationships between a couple of sample annotations.

Also, be careful with the dimension space of several annotations. For example, if your annotation is of the float type, you might have to round it (if not, the size of your data structure might become too big).

Now, let's take a look at the plotting codes. Let's perform this in two parts. Here is part one:

def plot_hz_rel(dps, ax, ax2, name, rel):
    frac_hz = []
    cnt_dp = []
    for dp in dps:
        hz = 0.0
        cnt = 0
        for khz, kdp in rel.keys():
            if kdp != dp:
                continue
            cnt += rel[(khz, dp)]
            if khz == 1:
                hz += rel[(khz, dp)]
        frac_hz.append(hz / cnt)
        cnt_dp.append(cnt)
    ax.plot(dps, frac_hz, label=name)
    ax2.plot(dps, cnt_dp, '--', label=name)

This function will take a data structure, as generated by get_sample_relation, expecting that the first parameter of the key tuple is the heterozygosity state (0=homozygote, 1=heterozygote) and the second parameter is DP. With this, it will generate two lines: one with the fraction of samples (which are heterozygotes at a certain depth) and the other with the SNP count.

Now, let's call this function:

fig, ax = plt.subplots(figsize=(16, 9))
ax2 = ax.twinx()
for name, rel in rels.items():
    dps = list(set([x[1] for x in rel.keys()]))
dps.sort()
plot_hz_rel(dps, ax, ax2, name, rel)
ax.set_xlim(0, 75)
ax.set_ylim(0, 0.2)
ax2.set_ylabel('Quantity of calls')
ax.set_ylabel('Fraction of Heterozygote calls')
ax.set_xlabel('Sample Read Depth (DP)')
ax.legend()
fig.suptitle('Number of calls per depth and fraction of calls which are Hz', fontsize='xx-large')

Here, we will use two axes. On the left-hand side, we will have the fraction of heterozygous SNPs. On the right-hand side, we will have the number of SNPs. We then call plot_hz_rel for both data files. The rest is standard Matplotlib code.

Finally, let's compare the DP variant with a categorical variant level annotation (EFF). EFF is provided by SnpEff and tells us (among many other things) the type of SNP (for example, intergenic, intronic, coding synonymous, and coding nonsynonymous). The Anopheles dataset provides this useful annotation. Let's start by extracting variant-level annotations and the functional programming style:

def get_variant_relation(recs, f1, f2):
    rel = defaultdict(int)
    for rec in recs:
        if not rec.is_snp:
            continue
    try:
        v1 = f1(rec)
        v2 = f2(rec)
        if v1 is None or v2 is None:
            continue # We ignore Nones
        rel[(v1, v2)] += 1
    except:
        pass
    return rel

The programming style here is similar to get_sample_relation, but we will not delve into any samples. Now, we define the type of effects that we'll work with and convert its effect into an integer (as it will allow us to use it as an index, for example, matrices). Now, think about coding a categorical variable:

accepted_eff = ['INTERGENIC', 'INTRON', 'NON_SYNONYMOUS_CODING', 'SYNONYMOUS_CODING']

def eff_to_int(rec):
    try:
        for annot in rec.INFO['EFF']:
            #We use the first annotation
            master_type = annot.split('(')[0]
            return accepted_eff.index(master_type)
    except ValueError:
        return len(accepted_eff)

We will now traverse the file; the style should be clear to you now:

eff_mq0s = {}
for vcf_name in vcf_names:
    recs = vcf.Reader(filename=vcf_name)
    eff_mq0s[vcf_name] = get_variant_relation(recs, lambda r: eff_to_int(r), lambda r: int(r.INFO['DP']))

Finally, we plot the distribution of DP using the SNP effect:

fig, ax = plt.subplots(figsize=(16,9))
vcf_name = 'standard.vcf.gz'
bp_vals = [[] for x in range(len(accepted_eff) + 1)]
for k, cnt in eff_mq0s[vcf_name].items():
    my_eff, mq0 = k
    bp_vals[my_eff].extend([mq0] * cnt)
sns.boxplot(data=bp_vals, sym='', ax=ax)
ax.set_xticklabels(accepted_eff + ['OTHER'])
ax.set_ylabel('DP (variant)')
fig.suptitle('Distribution of variant DP per SNP type', fontsize='xx-large')

Here, we just print a boxplot for the noncentromeric file, as shown in the following diagram. The results are as expected: SNPs in coding areas will probably have more depth, because they are in more complex regions that are easier to call than intergenic SNPs:

Figure 8: Boxplot for the distribution of variant read depth across different SNP effects