12.7 Data

12.7.0.1 Metadata

The accessions dataframe contains the GenBank IDs and full names of the coronavirus sequences we’re using:

accessions <- read.table("accessions.txt", header = TRUE, sep = "\t")
head(accessions)

##         id                                      name
## 1 DQ022305    DQ022305.2 Bat SARS coronavirus HKU3-1
## 2 DQ071615       DQ071615.1 Bat SARS coronavirus Rp3
## 3 DQ412043       DQ412043.1 Bat SARS coronavirus Rm1
## 4 JX993988  JX993988.1 Bat coronavirus Cp/Yunnan2011
## 5 FJ588686        FJ588686.1 Bat SARS CoV Rs672/2006
## 6 JX993987 JX993987.1 Bat coronavirus Rp/Shaanxi2011

# make vectors of the GenBank IDs and full names
# these will be used as input to functions later
ids <- accessions$id
names <- accessions$name

The SARS-CoV-2 sequence we’re using is MT093631 (MT093631.2 Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/CHN/WH-09/2020).

12.7.0.2 DNA sequences

We’ve downloaded and aligned the genome sequences of these coronaviruses in the aligned.fa FASTA file. Click on the file to preview the sequence of the first coronavirus:

>DQ022305
----------------------------------------GTTAGGTTTTTACCTACCCAGGAAA--AGCCAACCAACC-
TTGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAA------TCTGTGTGGCTGTCGCTCGGCTGCATGCCTAGCG
CACCTACGCAGTATAAATATTAAT-AACTTTACTGTCGTTGACAAGAAACGAGTAACTCGTCCCTCTTCTGCAGACTGCT

FASTA format

.fa indicates a FASTA file, which is a text-based format for representing DNA (or protein) sequences.

In a file that contains multiple sequences (like ours), the > character indicates the start of a new sequence and is usually followed by the sequence name.

Why do the sequences have to be aligned?

To construct a phylogeny, we compare how a site in the genome has changed in different coronavirus strains. Sequences need to be aligned so that we know we’re comparing the same site across sequences.