In this practical we will perform de novo assembly of a genome from Illumina reads. You will be assembling a strain of Staphyloccocus aureus.
We will use SPAdes for this activity. SPAdes is one of a number of de novo assemblers that use short read sets as an input and the method for assembly is based on de Bruijn graphs. For information about SPAdes see this link and there is a manual.
In this activity, we will perform a de novo assembly of a short read set (from an Illumina sequencer) using the SPAdes assembler. The output from SPAdes that we are interested in is a multi-fasta file that contains the draft genome sequence.
Navigate to the directory for the assembly activity:
/home/ubuntu/data/day-2/assembly-practical
You will see a pair of FASTQ files for a S. aureus strain. Notice that these FASTQ files have a slightly different filename which ends in .gz
. This means that the files are gzipped, which is a way to compress the files and make them smaller. To look inside the FASTQ files which are gzipped, you will need to use zless
to look inside a file, instead of using less
.
Question 1: Look at the fastq files. What is the sequence ID of the first FASTQ entry in the read 1 dataset?
The command zless S_aureus_1.fastq.gz
will open the file.
Now we will run spades
to assemble the genome:
spades.py -1 S_aureus_1.fastq.gz -2 S_aureus_2.fastq.gz -o S_aureus --careful -t 4
We will run spades with the --careful
setting which aims to minimise the number of errors.
Run the command spades.py
to get the Spades help message which will explain which option controls the number of threads.
Spades will take around 6 or 7 minutes to run so this is a good time to get up and get a cup of tea. When spades has finished, look at the output files which are produced.
Question 3: What version of spades
did we use for the assembly?
Look inside spades.log
: this is the log file with all the information about how spades ran and at the top it tells you the version. Or you can run spades.py -v
.
Now look at the assembled contigs file, contigs.fasta
. This is a fasta file containing multiple fasta entries, one for each assembled contig. As we know, the first line of a fasta entry starts with “>”. The rest of the line tells us information about the contig - the name of the contig, the length of the assembled sequence, and the read coverage of the contig.
Spades produces a file with contigs and a file with scaffolds. If Spades thinks that two contigs should be joined together, but does not know what the sequence between them should be or how long it is, it will join the two contigs together with a string of “NNNNNNN” for unknown sequence.
Now we will assess the quality of the contigs and scaffolds produced from Spades using a tool for assessing the quality of genome assemblies called Quast.
Run the following command:
quast -o quast -l spades-contigs,spades-scaffolds S_aureus/contigs.fasta S_aureus/scaffolds.fasta
Quast gives output both as text files, and as a report.html file which you can open in a web browser. Use FileZilla to move the quast report to your computer. This is the report.html file in the directory ~/data/day-2/assembly-practical/quast
.
Double-click on report.html
to open it in your browser. The report has an interactive table and a graph. Hover over each line in the table for an explanation.
Question 4: How many contigs (over 500bp) are there in the two assemblies? How long is the longest contig? How many Ns were added in the scaffolded genome? Which assembly do you think is better?
We can also look directly at the assembly graph which was used to produce the consensus assembly. The assembly graph represents the assembly of the genome as a series of nodes and edges, showing the possible overlaps and paths through the assembly graphs which have been resolved by the assembler. We will use a tool called Bandage for this.
First we need to download the assembly graph file. Right-click on the following link: assembly_graph.fastg and save the file on to your computer.
Open the program Bandage from the Windows menu. Go to File...
menu > Load Graph
and open the assembly_graph.fastg
file. Once it is loaded push the Draw graph
button in the Graph Drawing
menu.
Question 5: How many completely separate assembly graphs have been drawn? Why could you have multiple disconnected graphs?
Think about the DNA in the microbial cell and why it might be in multiple separate pieces.
We have two assembly graphs, which allows us to see that while we have more than 2 contigs, they came from 2 different DNA molecules which are not connected. To further investigate why these 2 different molecules are assembled as many more than 2 contigs, colour in your graph according to the depth of coverage of each node. To do this, change Random colours
to Colour by depth
in the menu under Graph Display
. The nodes are coloured from red to black, with higher coverage in red.
Question 6: What do you notice about the colour of the short nodes and the long nodes in the chromosome graph? Why might this be?
Question 7: Look at the plasmid graph (the smaller graph at the bottom of the screen). What colour are the nodes in this graph? Why might this be?
In the optional
directory, there are read sets and an assembly for two other organisms: Klebsiella pneumoniae and Mycobacterium tuberculosis.
Run quast
on the two assemblies. Remember to change the quast
output directory so you do not overwrite the results from Exercise 2.
Here is an example command line: quast -o exercise4 optional/K_pneumoniae.fasta optional/M_tuberculosis.fasta
Question 8: How many contigs are in the assemblies? How does this compare to our S. aureus assembly?