Classify reads

Reads can be classified taxonomically using k-mer based methods. Two popular programs for this purpose are Centrifuge and Kraken2. The workflow can run both tools on all samples specified in your sample annotation file and produce interactive krona-plots and kraken-style reports.

Database construction for kraken and centrifuge is not part of this workflow. Please refer to the respective documentation for these tools (Kraken manual, Centrifuge manual). However, both tools have pre-built indices which saves you the trouble of downloading and building these on your own.

Individual krona plots are created in results/centrifuge/ and results/kraken/. Combined krona plots are created in results/report/centrifuge and results/report/kraken. The kraken-style *.kreport files created within these directories can be loaded into the tool pavian. Either install the pavian R package or run it using Docker (see instructions at the GitHub repo).

Centrifuge

Relevant settings for Centrifuge are:

centrifuge_prebuilt: By default, the workflow uses the prebuilt p+h+v index containing prokaryotic, viral and human sequences which it will download from the centrifuge FTP. Other valid choices are: p_compressed+h+v, p_compressed_2018_4_15, nt_2018_2_12 and nt_2018_3_3. The index will be downloaded to resources/classify_db/centrifuge/.

centrifuge_custom: If you have created your own custom centrifuge index, specify the its path here, excluding the .*.cf suffix.

centrifuge_summarize_rank: The centrifuge index can be summarized to see number of sequences and total size in bp for each taxa at a certain rank specified by this config parameter (superkingdom by default). To generate the summary file run:

snakemake --config centrifuge=True centrifuge_summary

The summary output will be placed in the same directory as the centrifuge index (resources/classify_db/centrifuge if using the prebuilt index) in a file with the suffix ‘.summary.tab’.

centrifuge_max_assignments: This determines how many assignments are made per read. By default this parameter is set to 1 in this workflow which means centrifuge runs in ‘LCA’ mode similar to kraken where only the lowest common ancestor of all hits is assigned for a read. The default setting for centrifuge is 5.

centrifuge_min_score: This is used to filter out classifications when creating the kraken-style report. Note that this also influences the numbers in the Krona plots.

You can try out the Centrifuge read classifier by using one of the supplied examples. To see the rules that would be run enter:

snakemake --configfile examples/centrifuge/config.yaml -np

Now perform the actual run with 4 cores:

snakemake --configfile examples/centrifuge/config.yaml -p -j 4

Kraken2

Relevant settings for Kraken2 are:

kraken_prebuilt: By default, the workflow uses the prebuilt MiniKraken2_v2_8GB database built from the Refseq bacteria, archaea, and viral libraries and the GRCh38 human genome. Alternatively you can use MiniKraken2_v1_8GB which will download the database excluding the human genome. These minikraken databases are downsampled versions of the standard database which could influence sensitivity. However, evaluations suggest that with kraken2 the drop in sensitivity is minimal.

kraken_custom: If you have created your own custom centrifuge index, specify the path to the directory containing the hash.k2d, opts.k2d and taxo.k2d files.

kraken_reduce_memory: Set to True to run kraken2 with --memory-mapping which reduces RAM usage.

Try out the Kraken2 read classifier using the supplied example:

snakemake --configfile examples/kraken/config.yaml -np

Filtering results

Sometimes you may see taxa pop up in your classification output that you are not sure are really there. The k-mer based methods such as Centrifuge and Kraken do not provide information on e.g. genome coverage that tools like bowtie2 or blast can give. However, for very large datasets and very large genome databases using read-aligners is not tractable due to extremely long running times.

This workflow can utilize the speed and low disk-space requirements of Centrifuge to obtain a set of ‘trusted’ genomes that are used for more detailed mapping analyses.

To try this out on the example above enter:

snakemake --configfile examples/centrifuge/config.yaml --config centrifuge_map_filtered=True -np -j 4

Because we set centrifuge_map_filtered=True the output from Centrifuge is now filtered in a two-step setup.

To perform the actual runs (again with 4 cores) do:

snakemake --configfile examples/centrifuge/config.yaml --config centrifuge_map_filtered=True -p -j 4

Below are explanations of the different filtering steps and the output produced.

Centrifuge filtering

First taxids with at least centrifuge_min_read_count assigned reads are identified and the corresponding genome sequences are extracted from the Centrifuge database. By default centrifuge_min_read_count is set to 5000.

Sourmash filtering

Next, genomes passing the first filter are passed to [sourmash](https://github.com/dib-lab/sourmash) which builds ‘MinHash’ signatures of the genomes. These signatures are essentially highly compressed representations of the DNA sequences. Signatures are also computed for each (preprocessed) sample and these are then queried against the filtered genome signatures. This gives an estimate of how much a genome is covered by a sample by comparing the MinHash signatures. Settings which influence how this filtering step is performed are:

sourmash_fraction: the number of hashes to compute as a fraciton of the input k-mers. By default this is set to 100 meaning that 1/100 of the input k-mer are used to compute the MinHash signature. Increasing the setting will reduce the disk-space requirements but may also reduce performance.

sourmash_min_cov: This is the minimum coverage estimated from the sourmas filtering step that a genome must have in order to pass to the next steps. By default this is set to 0.1 meaning that a genome must be covered by at least 10%

in at least one of the samples.

Bowtie2 alignments

Finally, genomes that pass both filters are indexed using bowtie2 and the (preprocessed) reads are mapped to this set of genomes. The resulting bam-files are used to calculate coverage of genomes across samples.

Output

Under the results/examples/centrifuge_classify/report/ directory you will find a bowtie2 sub-folder with indexed alignment files in bam format as well as plots of:

Samplewise coverage

genome coverage in each sample and:

Mean genome coverage

average coverage per genome.

In addition, the bam files can be loaded directory into Pavian. See some instructions here).

Note that you will have to increase the `maxUploadSize`of the pavian app in order to load most bam files. If you installed pavian via R you can run the app as:

pavian::runApp(port=5000, maxUploadSize=500*1024^2)