Bioinformatics Tutorials - West African Centre for Cell Biology of Infectious Pathogens

Genome Assembly

What is Genome Assembly?

The assembly step involves reconstructing sequences from short DNA or RNA fragments generated by sequencing platforms into contigous sequences (known as contigs).
The process includes selecting seed sequences, extending and merging contigs, and addressing gaps for comprehensive genome coverage.
Parameters such as k-mer size, coverage, and quality control play crucial roles. Several tools have been developed for genome assembly and most are based on deBrujn graphs.
Some of the tools work on long read sequencing data(flye,canu, etc), others are suitable for short reads(spades,velveth,etc).
Some tools are also able to perform hybrid assembly which involves the use of both long and short reads. One such tool is unicycler.

Denovo Asssembly of Short Reads

In this module we are going to assemble the genome of a Methicilin-resistant Staphylococcus aureus strain using SPAdes.The input data are found in the directory DATASETS/MRSA_ILLUMINA_TRIMMED

mkdir GenomeAssembly

cd GenomeAssembly

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/DRR187/DRR187559/DRR187559_1.fastq.gz

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/DRR187/DRR187559/DRR187559_2.fastq.gz

fastp -i DRR187559_1.fastq.gz -I DRR187559_2.gz -l 30 --cut_front --cut_tail -W 4 -M 20 -o DRR187559_1.trimmed.fastq.gz -O DRR187559_2.trimmed.fastq.gz --html DRR187559.html --json DRR187559.json

Now , lets run SPADES

spades.py --careful -t 10 -1 DRR187559_1.trimmed.fastq.gz -2 DRR187559_2.trimmed.fastq.gz -o DRR187559

-1: forward read(read1).
-2: reverse read(read2).
-o:output directory.
-t:number of threads to use.
--careful:tries to reduce number of mismatches and short indels.

Inspecting the Output

The outputs will be found in a directory called DRR187559. The assembled sequences called contigs are saved in the 'contigs.fasta' file and that is what we will focus on in this course.

Question

1. What is the format for the contigs.fasta file?

2. How many contigs have been assembled? (hint: search for lines beginning with '>')

Evaluating Genome Assemblies using QUAST

quast.py -o QUAST_DRR187559 DRR187559/contigs.fasta

Inspecting the QUAST report

The QUAST output will be found in the directory 'QUAST_DRR187559'. An html page called 'report.html' contains all the computed assembly statistics
Lets open that using our web browsers.

The QUAST report includes the following statistics:

Contigs: The total number of contigs in the assembly.
Largest contig: The length of the largest contig in the assembly.
Total length: The total number of bases in the assembly.
GC content: Percentage of bases which are either guanine or cytosine.

Question

1. How many contigs have a length >=1000bp?

2. What is the length of the largest contig?

3. What is the total length of the assembled genome?. Is it similar to Staphylococcus aureus reference genome? (See here )

4. What is the N50 of our assembled genome?

5. What is the GC content for our assembled genome?

Evaluating Genome Assemblies using BUSCO

BUSCO (Benchmarking Universal Single-Copy Orthologs) allows a measure for quantitative assessment of genome assembly based on presence or absence of highly conserved genes in the assembly. During the analysis, BUSCO will generate a score which can be used as measure of the assembly quality.

Lets run BUSCO on our assembly

busco -m genome -i DRR187559/contigs.fasta -l bacillales_odb10 -o BUSCO_DRR187559

-m:analysis mode. here we use genome.
-i:input sequence file.
-l:lineage.
-o:output directory.

Examine the BUSCO Output

When busco completes the analysis, a summary report is printed on the screen.
A detailed report is placed in the specified directory. Feel free to examine the files.

Genome Assembly

About WACCBIP

Follow Our Community

Latest Blog Posts

Our Downloads

Quick Links