Learning objectives

In this practical we will learn some basic techniques to check the quality of your sequencing dataset before we perform any further analysis, and fix some common problems.

  1. Use FastQC to assess the quality of your data and understand how problems arise
  2. Use trimming tools to remove problem bases
  3. Use Picard to analyse GC bias in your data

Exercise 1

The first program we will run is FastQC: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

This takes a BAM or FASTQ file as input. Here we will use the two FASTQ files from an isolate of Burkholderia pseudomallei.

In the terminal, log on to your CLIMB server and navigate to the directory /home/ubuntu/data/day-2/sequenceqc-practical

Use the cd command to change directory.


Fastqc likes the output directory to exist before you run it, so make a directory called fastqc: mkdir fastqc

Now you can run fastqc:

fastqc --outdir fastqc reads_1.fastq reads_2.fastq

This will take a few minutes to run and give you a progress message while it runs.

Navigate to your output directory fastqc and look at the files inside. You should see a .html and a .zip file named after each of your FASTQ files.

The .html file is designed to be viewed in a web browser. Move it to your desktop using Filezilla and then open them in your browser.

Question 1: What tests do your files pass? What do they fail? Think about what might cause them to fail. Which file has the best quality reads?

Exercise 2

In exercise 1, you identified that the reads in this dataset are contaminated with Illumina sequencing adapters. We will use cutadapt to remove the adapters from this dataset.

Run cutadapt with the following command line:

cutadapt -O 3 -G AGATCGGAAGAGC -g AGATCGGAAGAGC -o out_1.fastq -p out_2.fastq reads_1.fastq reads_2.fastq

This tells cutadapt to trim both reads in our paired-end read dataset together. The -G and -g tell it which adapter should be removed from the 3’ end of the forward and reverse reads respectively. In this case it is the sequence of the Illumina sequencing adapter which should be removed.

Question 2: How many reads had adapters trimmed from the ends?

Question 3: What command would you run to run fastqc on the trimmed output?

Take your fastqc command line above and change the read files to out_1.fastq and out_2.fastq. Remember you need to make the output directory before you run fastqc.


Exercise 3

To investigate GC bias in our sequencing we can use a tool called Picard. Picard contains many different useful tools, but the one we are using today is the CollectGcMetrics tool. This takes a BAM files and the reference genome for your organism and gives you a summary of the GC bias in your sequencing, and whether regions of high and low GC are over- or underrepresented. We will look at GC bias in the reads we used for the first part of the practical, which are from Burkholderia pseudomallei.

Run picard CollectGcBiasMetrics on the file reads.bam using the reference genome E264.fasta:

picard CollectGcBiasMetrics REFERENCE_SEQUENCE=E264.fasta INPUT=reads.bam OUTPUT=gcbias.out CHART=gcbias.pdf SUMMARY_OUTPUT=gcbias.summary
Look at the summary file gcbias.summary.

Use the less command to open a text file.


In the last two lines of the file, you will see the column headings and the results for this analysis. Look at the columns for GC bias in different GC windows (GC_NC_0_19 GC_NC_20_39 GC_NC_40_59 GC_NC_60_79 GC_NC_80_100), which show GC bias scores for windows with GC content of 0-19%, 20-39%, 40-59%, 60-79%, and 80-100%.

Question 4: What are the numbers in these columns? Which GC windows are overrepresented (a score above 1)? Which are underrepresented (a score below 1)?

There is also a graphical output from the CollectGCBiasMetrics tool. Move it to your desktop using Filezilla and open it. This shows the coverage in each GC window of the genome as open blue circles. Above 1 the GC window is overrepresented, below 1 they are underrepresented.


Optional Exercise 4

If you have finished the exercise in plenty of time, here is an optional exercise for you. Look at the cutadapt command line above:

cutadapt  -O 3 -G AGATCGGAAGAGC -g AGATCGGAAGAGC -o out_1.fastq -p out_2.fastq reads_1.fastq reads_2.fastq
Question 5: what is the -O option doing in the command line above?

Run cutadapt --help to see the help file.


Question 6: Try running cutadapt with a different -O value. What happens if you vary the -O parameter? Why?

Question 7: How would you modify the command to remove low-quality bases from the end of your reads?