A Study of Location-Based Mutation Patterns in the SARS-CoV-2 Genome
This past semester, I worked in a pair on a research project for a computational molecular biology and genomics course. We were interested in investigating the potential role location plays in the evolutionary landscape of the SARS-CoV-2 genome, especially in light of new waves of viral spread and new sequence data available in recent months. At a high level, our work involved cleaning and processing thousands of SARS-CoV-2 sequences, writing Python scripts for eliminating corrupted sequences, identifying segregating sites and single nucleotide polymorphisms, plotting allele frequencies at sites of interest over time, and creating other visualizations to understand potential location-mediated mutation patterns. From this study, we identified a triple mutant on the coronavirus nucleocapsid protein that may hold implications for understanding virus adaptation to human hosts.
The Github repository for this project can be found here.
read the paper abstract below:
Abstract
Severe acute respiratory syndrome coronavirus type 2 (SARS-CoV-2) has devastated populations globally, and continues to pose extreme challenges to not only public health, but to the scientific methods utilized to understand and analyze epidemiology. Because the spread of SARS-CoV-2 is rapid and ever-changing in its dynamics, many recent research efforts in the sphere of computational biology have been largely focused on painting a more comprehensive picture of virus-host interactions and viral evolution. While existing work has, with varying degrees of success, identified several mutations in the coronavirus genome that may be linked to transmission susceptibility and other important factors for understanding infection, little work has successfully characterized mutation evolution in the context of small-scale geographic location. Thus, this study builds on previous work by analyzing single nucleotide polymorphisms in location-tagged SARS-CoV-2 sequences via evaluations of allele frequency fluctuations over time and genotype-annotated phylogenetic trees. Through the present investigation and analysis, a triple mutant from GGG to AAC at loci 28881, 28882, and 28883 was identified on the nucleocapsid protein of SARS-CoV-2. Further phylogeographic analysis suggests that this triplet of mutations may confer a selective advantage, paving the way for further analysis about viral adaptation to human hosts.

A table of information about SNPs identified in our dataset of 1241 SARS-CoV-2 sequences. Loci with matching distributions and identical time plots by month are highlighted in the same color. For instance, we observed the same distribution of ancestral and derived alleles at loci 18998 and 29540, and also observed identical time plots by month for those loci.

Sample of allele frequencies by month plotted from March to October 2020 for pairs of variants at a given SNP. Each data point represents the total frequency of ancestral or derived alleles for the corresponding month. While the majority of the plots appear to show some level of fixation (Loci 619, 1917, 3037, 4113, 8078, 8782, 10851, 11083, 14408, 14912, 17247, 20005, 20755, 23403, 26144, and 28144), we identify an interesting pattern at three contiguous loci: 28881, 28882, and 28883.

Heatmap of derived allele frequencies by county. All SNP loci are represented on the x-axis, all counties are represented on the y-axis, and each county-locus pair is colored by the frequency of the derived allele at that locus and in that county. This visualization allows us to compare SNP distributions across counties by identifying similar bands in the heatmap. For instance, we observe a pattern of banding that occurs in the data from Manhattan, the Bronx, Brooklyn, Queens, and New York. Likewise, similar banding can be observed for data from Albany and Montogomery.