12.7 Data
12.7.0.1 Metadata
The accessions
dataframe contains the GenBank IDs and full names of the coronavirus sequences we’re using:
## id name
## 1 DQ022305 DQ022305.2 Bat SARS coronavirus HKU3-1
## 2 DQ071615 DQ071615.1 Bat SARS coronavirus Rp3
## 3 DQ412043 DQ412043.1 Bat SARS coronavirus Rm1
## 4 JX993988 JX993988.1 Bat coronavirus Cp/Yunnan2011
## 5 FJ588686 FJ588686.1 Bat SARS CoV Rs672/2006
## 6 JX993987 JX993987.1 Bat coronavirus Rp/Shaanxi2011
# make vectors of the GenBank IDs and full names
# these will be used as input to functions later
ids <- accessions$id
names <- accessions$name
The SARS-CoV-2 sequence we’re using is MT093631
(MT093631.2 Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/CHN/WH-09/2020
).
12.7.0.2 DNA sequences
We’ve downloaded and aligned the genome sequences of these coronaviruses in the aligned.fa
FASTA file. Click on the file to preview the sequence of the first coronavirus:
>DQ022305
----------------------------------------GTTAGGTTTTTACCTACCCAGGAAA--AGCCAACCAACC-
TTGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAA------TCTGTGTGGCTGTCGCTCGGCTGCATGCCTAGCG
CACCTACGCAGTATAAATATTAAT-AACTTTACTGTCGTTGACAAGAAACGAGTAACTCGTCCCTCTTCTGCAGACTGCT
FASTA format
.fa
indicates a FASTA file, which is a text-based format for representing DNA (or protein) sequences.
In a file that contains multiple sequences (like ours), the >
character indicates the start of a new sequence and is usually followed by the sequence name.
Why do the sequences have to be aligned?
To construct a phylogeny, we compare how a site in the genome has changed in different coronavirus strains. Sequences need to be aligned so that we know we’re comparing the same site across sequences.