Learning objectives

To become familiar with:

The mechanics of running BLAST searches
How to interpret BLAST results
Ways in which homology can complicate the interpretation of BLAST results
How BLAST, combined with the use of the extra information in graph representation of genome assemblies can aid in the localisation of AMR determinants to genomic elements

Exercise 1

In this exercise you will be given 6 assembled Salmonella genomes. You need to identify which serotype they are based on the results of BLAST analysis against a custom database of Salmonella antigen genes, and cross checking the results with (a shortened version of) the Kauffman-White scheme.

In the terminal, navigate to the /home/ubuntu/data/day-3/blast-practical directory, you should see 6 Salmonella reference genomes (fasta format) labelled Salm_1 to Salm_6 and a salmonella_antigens.fasta file. In order to run BLAST, we need a query and a database. In this case, the database is going to be the salmonella_antigens.fasta file, and the queries are going to be the 6 Salmonella assemblies. For simplicity, we will process them one at a time.
First we need to process the FASTA which will be our database so that it is compatible with BLAST. For this we use the makeblastdb program. -

makeblastdb -dbtype nucl -in salmonella_antigens.fasta

Question 1: How many sequences are there in the salmonella_antigens.fasta file? What are their names?

Since we know that each fasta record has a header, and each header starts with a ‘>’. We can use grep ‘>’ salmonella_antigens.fasta to show all the sequence headers in the file. CAREFUL - make sure to include the quotation marks around the > in that command, or you will overwrite the contents of salmonella_antigens.fasta.

3.1. Now, before we BLAST our query against the database we just made, we should decide which output format to use. Information on the output formats is available via the -help option. Note that we use blastn here as we are comparing nucleotides. Run:

blastn -help

Question 2: Look at the *** Formatting options section of the blastn -help output. We are going to load our data into Excel; which do you think is the most suitable -outfmt option to use?

If you aren’t sure which is the best format, move onto section 3.2 and experiment with different -outfmt options.

3.2. Now we are going to run our first BLAST. Enter the below command, where XXX is the outfmt option you chose in section 3.1

blastn -query Salm_1.fasta -db salmonella_antigens.fasta -outfmt XXX -out Salm_1_vs_salmonella_antigens.tsv

3.3. Then, examine the output file using head.

head Salm_1_vs_salmonella_antigens.tsv

Question 3: How many columns does the output have?

Question 4: How many lines of the file has head shown you? How do you only show the top 4 lines of the file?

Run head --help for more info.

Question 5: What is separating the columns?

What does the T in tsv stand for?

Question 6: What happens if you run the command without the -outfmt option, or with different -outfmt options?

Run the command with different outfmt options and look at the output file using head or cat.

3.4. Run blastn for each genome against the salmonella_antigens.fasta database. Remember to change the name of the output file to contain the name of the input genome assembly. I.e. the results for Salm_1 should contain the name salm1.

3.5. Make sure that each of your output files (e.g. Salm_1_vs_salmonella_antigens.tsv) is in the best format for viewing in Excel before proceeding.

Question 7: What kind of blast would we do if we were comparing proteins against Salm_1.fasta? What kind of BLAST would we do if we were comparing salmonella_antigens.fasta against Salm_1.fasta?

Now, we are going to look at the BLAST output using Excel. For this, the output files (e.g. Salm_1_vs_salmonella_antigens.tsv) need to be on your local computer rather than your CLIMB instance.

4.1. Use Filezilla to move the output files onto your local computer and open them one at a time using Excel.

You also need to download the Salmonella serotypes.xlsx file to your local computer.

4.2. You might notice that blastn does not include column names in the output, these are the column headings, paste them into row 1 in Excel.

Columns
query_id
subject_id
pct_identity
aln_length
n_of_mismatches
gap_openings
q_start
q_end
s_start
s_end
e_value
bit_score

If you are having trouble with changing the orientation of the headers, use Paste Special -> transpose in Excel.

Question 8: What do the different column headings mean?

Use Google to find out. For example, here or here

Question 9: How many BLAST hits are there for Salm_1 vs `salmonella_antigens?

Question 10: Look at the help for blastn, can you figure out some additional command line parameters you could enter which would reduce the number of results you have to look at? What are the potential downsides of applying these additional parameters?

Question 11: Which is the best column to sort by so that you see the best hits first? Sort the data by this column.

Question 12: There are multiple different alleles of fljB and fliC in the BLAST database. Look at your BLAST results to identify the best fljB and fliC hits for each genome, and then look at the different serotypes in the Salmonella Serotypes excel file to identify which serotype has that combination of antigenic genes. Do this for all 6 genomes.

Reminder: fliC is H1 and fljB is H2.

One of the genomes might not have both H1 and H2.

Take the BLAST results for Salm_6. Using Excel, filter the BLAST results so that only hits where the subject was either the ‘best’ fljB or ‘best’ fliC. You should have 6 blast hits.

Question 13: Why are there 6 results?

Look at the query start and query end values for the fljB and fliC hits.

There is an area of sequence similarity between fljB and fliC.

Question 14: What is another way you could tell which serotype the genomes are? The above method, looking at the genes which determine the serotype could be described as the ‘biologist’ way of serotyping from the genome, what might the ‘computer scientist/bioinformatician’ way be?

If a biologist uses the genes underlying a ‘traditional’ microbiological property like antigenic structure to identify the serotype, what might someone who knows nothing about the genes underlying the phenotype do?

Exercise 2

In exercise 2, you will use BLAST within the Bandage genome graph visualiser to determine whether antibiotic resistance determinants are carried on plasmids or chromosomes.

You will start by carrying out some basic data exploration on the command line of your remote instance. Locate the files in /home/ubuntu/data/day-3/blast-practical. Below is the output of the following command bioawk -c fastx '{ print $name, length($seq) }' < genes_and_plasmids.reduced.fasta

bioawk output

Question 1: What are the longest and shortest sequences in genes_and_plasmids.fasta?

Use google to tell you how to do this. this answer works well

Question 2: What are the differences between the sequences which have a name beginning with ‘p’ (i.e. lowercase p) and those which don’t? This is a question about your knowledge of microbiology naming conventions.

Question 3: Either using Google or your microbiology knowledge, which AMR genes in the file are typically plasmid-borne and which are typically chromosomal?

Use grep ‘>’ genes_and_plasmids.fasta to find the names of the sequences in the file.

Using filezilla, download the 3 FASTG files and genes_and_plasmids.fasta from /home/ubuntu/data/day-3/blast-practical to your local computer
Load the first sample into Bandage following the instructions from the Bandage github page

Question 4: What are the characteristics of this assembly in terms of the length, the N50, etc?

Question 5: Do you think the assembly used an appropriate range of k-mer sizes? What other datasets do we need to get a better answer to this question?

Read this

Question 6: Can you see any sequences which look like they could be plasmids?

Prepare the FASTG file for BLAST analysis by following the instructions from here

You need to install BLAST so that it can be used by Bandage. For this, download BLAST from here, and place it in the same directory as the Bandage executable. This will probably be in the download directory.

BLAST the genes_and_plasmids.fasta against the genome loaded into Bandage. You can reduce the view to include only blast hits by changing the ‘scope’ drop down in the ‘graph drawing’ section, select ‘around blast hits’, change the distance to e.g. 2 or 3, and click draw graph. The distance controls how many contigs will be included in the view. Setting distance to 3 means that all contigs which are within 2 or 3 ‘jumps’ of a contig with a blast hit will be included. Experiment with this parameter to see what effect having it at e.g. 0 and 10 has.
Inspect the genome graph and the BLAST hits, explore the different options Bandage has for ‘scope’, zoom in and out, Node labels, Font etc.

Question 7: For each genome, which plasmids and AMR genes are present in the sample?

It might be easier to view the BLAST results via the ‘Create/view BLAST search results’

Question 8: If a plasmid is present, what does the ‘depth’ node label (combined with the graph structure) tell you about repeat sections in the plasmid?

Question 9: In the assembly SRR5451282, what is interesting about the CTX-M15 encoding gene?

blast every contig linked to the contig which encodes ctx-M15, you can do this by copying the node sequence or by using BLAST directly through Bandage via the Output tab.

Bonus Question 10: why do you think OmpD is included in the genes_and_plasmids.fasta?

Bonus Exercise

If you have finsihed all the above exercises in good time, then feel free to have a go at this exercise, carrying out a Needleman-Wunsch alignment by hand. You can use paper or Excel, whichever you prefer.

Establish a scoring system - e.g. match = 1, mismatch = -1, gap = -1
You will be aligning two sequences GAATTAA and GATTACA. Create a scoring matrix like the below:
Fill in the table according to the instructions here
Trace back the arrows from the bottom right cell to the top left following the rules here
Check your answer by putting the sequences into this website

BLAST practical

Philip Ashton

28 May, 2019

Learning objectives

Exercise 1

Exercise 2

Bonus Exercise