Be able to interpret SNP distance histograms.
Be able to build phylogenetic trees from snippy output.
Be able to interpret phylogenetic trees for epidemiological purposes.
Be familiar with various tree visualisation options.
Have a grasp on the relationship between trees and SNPs.
Inject 1 - In the aftermath of a massive earthquake in Haiti, the poorest country in the Americas, there has been a dramatic and worrying increase in the number of cases of cholera, the disease caused by the bacterium Vibrio cholera. This is the first time in more than 100 years that there has been a cholera outbreak in Haiti. Urgent efforts are underway to identify the source of the outbreak.
Fortunately, in addition to field epidemiologists carrying out shoe-leather investigations in the affected areas, we have WGS of some of the isolates from the outbreak. You are the lead bioinformatician in charge of analysing this data. You need to process and interpret the data, and feedback to the public health doctors in charge of outbreak response to provide them with the fullest picture possible.
From using Google, or your microbiology knowledge about cholera:
Question 1: Why might there be an increase in cholera cases following an event like an earthquake?
Question 2: What is unusual about the genome of V. cholera?
On your CLIMB instance:
Navigate to /home/ubuntu/data/day-4/outbreak_practical/dataset1
.
Run in this directory. This directory contains the alignment of all the high quality variable sites found in the core genome for 19 V. cholera genomes from Haiti and 21 other V. cholera genomes from around the world by snippy-core
.
ls -lh
disty dataset1.aln | python ~/scripts/process_disty_output.py | histogram.py -b 30 --percentage
The process_disty_output.py
script is available here. The histogram.py
script is from this package.
Question 3: How many processes are happening in this command?
The |
character is the unix command for ‘piping’ data between different processes.
Question 4: What does each stage of the process do?
Run the different processes of the command, building up to the full command.
Bonus question: If you know any python, can you figure out what each line of process_disty_output.py
is doing?
Question 5: What is the median pairwise SNP distance? How well do you think this describes the dataset?
Question 6: What does the histogram tell you about the relatedness of the samples in your dataset?
Question 7: Why are there 780 ‘samples’ in the output?
At this step, you will run IQ-TREE to generate a Maximum Likelihood tree. By default IQ-TREE will also test a wide range of nucleotide substitution models to see which best fits the data. For larger alignments this can be a very time consuming step, so if speed is important, it might make sense to omit this step and use a general model (e.g. GTR). You can find more information about how to run a specific model by looking at iqtree -h
.
Now run the below
iqtree -s dataset1.aln -nt AUTO
Question 8: Read the IQTREE output, which model was selected as the best? What is the full name of the model? I.e. what do the initials stand for?
The output of IQTREE is available in the file which ends .log
.
Question 9: How many threads were selected as best? Why might the maximum number of threads not be selected?
Download the tree (the file you want is the one which ends in .treefile
) to your local computer with Filezilla
. You should also download the geog_loc_microreact.csv
file. Change the file extension of the tree file to .nwk
.
Navigate to https://microreact.org/upload and upload the dataset1.aln.nwk
file and the geog_loc_microreact.csv
file. The microreact interface is hopefully quite intuitive, spend a few minutes becoming familiar with the various options and buttons. Microreact is an ideal way to investigate relationships between phylogeny and geography (aka phylogeographical relationships).
Question/task 10: Identify the Haiti outbreak genomes in the tree.
They are most of the non-black tips.
Question 11: Are the isolates from Haiti a monophyletic group?
Remember - a monophyletic group is a group which consists of all the descendents of a common ancestor.
Question 12: Is there phylogeographical signal in the V. cholerae genomes from within Haiti? You will have to zoom in on Haiti to see this. I.e. are the isolates from different regions of Haiti genetically distinct from each other?
You can think of this another way, looking at the data, if you were given another Haitian isolate with no geographical information, how confident would you be in your prediction of it’s geographic origin?
Question 13: Where is the closest neighbour of the Haiti genomes from?
Question 14: Write a short paragraph, summarising your results for the public health officials.
Inject 2 - Using the feedback we gave them on the likely origin of the V. cholera clade causing the outbreak, public health officials have traced the likely source of the cholera outbreak to the UN Stabilisation Mission in Haiti Camp. A contingent of Nepalese soldiers had recently arrived in camp, and cholera is endemic in Nepal. Sewage from the camp is discharged into the Artibonite River, where many people draw their drinking water from. This is an incredibly sensitive subject because, while identifying the true source is very important, if the source is mis-identified, it could lead to a major diplomatic incident.
Therefore, you have decided to search the public databases for all sequences from Nepal and Bangladesh and re-run the analysis, including these genomes. This will provide extra context for the Haitian outbreak and increase our certainty of what is happening.
On your CLIMB instance:
Navigate to /home/ubuntu/data/day-4/outbreak_practical/dataset2
.
Run ls -lh
in this directory. This directory contains the alignment of all the high quality variable sites found in the core genome for 19 V. cholerae genomes from Haiti and 101 other V. cholerae genomes from around the world by snippy-core
. When deciding which genomes to include, we have focussed on isolates from the Indian sub-continent.
Run IQ-TREE to generate a Maximum Likelihood tree
iqtree -s dataset2.aln -nt AUTO -bb 1000
Question 1: What is different between this command and the last time we ran iqtree
? What does this option mean?
cat dataset2.aln.treefile | python ~/scripts/midpoint_root_tree.py | nw_order -c n - > dataset2.aln.pretty.tree
The midpoint_root_tree.py
script is available here.
On your local machine, download FigTree from here, and install it.
Then download the dataset2.aln.pretty.tree
onto your local computer, along with annotation_figtree.txt
and annotation_phandango.csv
and open it with FigTree
.
In FigTree
, go to File -> Import annotations
in the menu bar, and navigate to the annotation_figtree.txt
file. Select this file and press open. Then in the main FigTree
window, go to Tip Labels
, and in the drop down Display
menu, select ANNOT
. You should now be able to see tip labels containing information about continent, country and year on the tree.
Question 2: SRR308716
was the most closely related non-Haitian genome to the Haiti outbreak strains in dataset1
. Is it still the most closely related genome?
Question 3: Do all the genomes from Nepal form a monophyletic group? What does this tell you about cholera in Nepal?
Question 4: What does the root of the tree represent?
Figtree
interface, tick the box which says ‘Node labels’, click the arrow to enable you to select ‘label’ from the dropdown Display
menu. This will place the bootstraps on the nodes of the tree.Question 5: Does adding in the bootstraps alter your interpretation at all?
Question 6: What is the impact of this extra context on your report to the public health doctors? Write a short paragraph reporting your new findings.
Extra task: Use the phandango
tool to visualise the meta-data in a more interesting way. Navigate to here. And drag and drop your tree and the annotation_phandango.csv
file onto the web page. Pressing k
will bring up the key. Further instructions are here.
Advanced question 1: Using the phylogenetic tree, estimate the SNP distance between SRR308691
and SRR308715
. A piece of paper might help you estimate the distance between the two tips. Remember you have to go from one tip, down to the MRCA, back to the other tip. Only ‘horizontal’ distance (if you are viewing the tree in rectangular format) counts. The other information you need to calculate this is the tree length, which is 0.029 (from IQTREE log) and the alignment length, which is 987 bp (calculated by bioawk -c fastx '{ print $name, length($seq) }' < dataset2.aln
).
Comes from this formula: (distance between nodes
/ total tree length
) * alignment length
Advanced question 2: How does this compare with disty
?
grep -A 1 SRR308691 dataset2.aln > tmp.fa; grep -A 1 SRR308715 dataset2.aln >> tmp.fa; disty tmp.fa