In this practical we will learn some basic techniques to check the quality of your sequencing dataset before we perform any further analysis, and fix some common problems.
The first program we will run is FastQC: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
This takes a BAM or FASTQ file as input. Here we will use the two FASTQ files from an isolate of Burkholderia pseudomallei.
In the terminal, log on to your CLIMB server and navigate to the directory /home/ubuntu/data/day-2/sequenceqc-practical
Use the cd
command to change directory.
Fastqc likes the output directory to exist before you run it, so make a directory called fastqc: mkdir fastqc
Now you can run fastqc:
fastqc --outdir fastqc reads_1.fastq reads_2.fastq
This will take a few minutes to run and give you a progress message while it runs.
Navigate to your output directory fastqc
and look at the files inside. You should see a .html and a .zip file named after each of your FASTQ files.
The .html file is designed to be viewed in a web browser. Move it to your desktop using Filezilla and then open them in your browser.
Question 1: What tests do your files pass? What do they fail? Think about what might cause them to fail. Which file has the best quality reads?
In exercise 1, you identified that the reads in this dataset are contaminated with Illumina sequencing adapters. We will use cutadapt
to remove the adapters from this dataset.
Run cutadapt with the following command line:
cutadapt -O 3 -G AGATCGGAAGAGC -g AGATCGGAAGAGC -o out_1.fastq -p out_2.fastq reads_1.fastq reads_2.fastq
This tells cutadapt to trim both reads in our paired-end read dataset together. The -G and -g tell it which adapter should be removed from the 3’ end of the forward and reverse reads respectively. In this case it is the sequence of the Illumina sequencing adapter which should be removed.
Question 2: How many reads had adapters trimmed from the ends?
Question 3: What command would you run to run fastqc on the trimmed output?
Take your fastqc command line above and change the read files to out_1.fastq and out_2.fastq. Remember you need to make the output directory before you run fastqc.
To investigate GC bias in our sequencing we can use a tool called Picard. Picard contains many different useful tools, but the one we are using today is the CollectGcMetrics tool. This takes a BAM files and the reference genome for your organism and gives you a summary of the GC bias in your sequencing, and whether regions of high and low GC are over- or underrepresented. We will look at GC bias in the reads we used for the first part of the practical, which are from Burkholderia pseudomallei.
Run picard CollectGcBiasMetrics on the file reads.bam
using the reference genome E264.fasta
:
picard CollectGcBiasMetrics REFERENCE_SEQUENCE=E264.fasta INPUT=reads.bam OUTPUT=gcbias.out CHART=gcbias.pdf SUMMARY_OUTPUT=gcbias.summary
gcbias.summary
.
Use the less
command to open a text file.
In the last two lines of the file, you will see the column headings and the results for this analysis. Look at the columns for GC bias in different GC windows (GC_NC_0_19 GC_NC_20_39 GC_NC_40_59 GC_NC_60_79 GC_NC_80_100), which show GC bias scores for windows with GC content of 0-19%, 20-39%, 40-59%, 60-79%, and 80-100%.
Question 4: What are the numbers in these columns? Which GC windows are overrepresented (a score above 1)? Which are underrepresented (a score below 1)?
There is also a graphical output from the CollectGCBiasMetrics tool. Move it to your desktop using Filezilla and open it. This shows the coverage in each GC window of the genome as open blue circles. Above 1 the GC window is overrepresented, below 1 they are underrepresented.
If you have finished the exercise in plenty of time, here is an optional exercise for you. Look at the cutadapt command line above:
cutadapt -O 3 -G AGATCGGAAGAGC -g AGATCGGAAGAGC -o out_1.fastq -p out_2.fastq reads_1.fastq reads_2.fastq
Run cutadapt --help
to see the help file.
Question 6: Try running cutadapt
with a different -O
value. What happens if you vary the -O parameter? Why?
Question 7: How would you modify the command to remove low-quality bases from the end of your reads?