How to run the workflow

Once you’ve modified the configuration file to your liking and created a sample_list for your samples the most basic way to run the workflow is with the command:

snakemake —-use-conda --configfile <myconfig.yaml> --cores 4

Here --cores specifies how many CPU cores to use in parallell.

Useful snakemake options

Other useful options that you can specify to snakemake on the command line include:

  • -r or --reason: print the reason for execution of each rule
  • -p or --printshellcmds: print shell commands that will be executed
  • --nolock: do not lock the working directory
  • --unlock: unlock working directory
  • --ri or --rerun-incomplete: re-run all jobs where output may be incomplete
  • --jobs [N] or -j [N]: Use at most N CPU cores/jobs in parallel. If -j is used with the SLURM profile to submit jobs in a compute cluster infrastructure, -j specifies the maximum number of jobs to submit to the queue.

For a full list of snakemake command line options, see here.

Targets

If you don’t want to run the full workflow from start to finish in one go you may specify one or several ‘targets’ on the commandline. These targets correspond to rules in the workflow that effectively work as intermediary checkpoints, allowing you to break up your workflow runs into smaller parts. This can be useful if you are trying out different parameters or want to get a better overview of the output. Useful targets are:

  • qc: Runs preprocessing (as specified in the configuration file) and generates a sample_report.html
  • assemble: Assembles samples according to sample list and config file, and generates some assembly statistics
  • quantify: Quantifies open reading frames called on assembled contigs, producing RPKM-normalized and raw counts
  • annotate: Annotates open reading frames called on assembled contigs using settings defined in the config file. Also quantifies genes and features, producing normalized and raw counts
  • taxonomy: Assigns taxonomy to assembled contigs and open reading frames called on those contigs
  • bin: Performs genome binning of assembled contigs and generates some statistics of the binned genomes
  • classify: Read-based (e.g. Kraken/Centrifuge/MetaPhlAn) classification of preprocessed reads

To use these targets add them to the snakemake command line call. For instance, to run only the preprocessing part:

snakemake --use-conda --configfile config.yaml -j 4 qc

Targets may also be combined, so if you want to generate assemblies and run read-based classification you can do:

snakemake --use-conda --configfile config.yaml -j 4 assemble classify

Reports

After the workflow has completed you can generate a report with summarized statistics of the run. Depending on the run, the report will also include links to output files produced (_e.g._ tables, plots and html files). To produce a report, run:

snakemake --report report.html

Warning

When generating the report you must call snakemake the same way you did when you ran the workflow itself otherwise snakemake will report a WorkflowError: because the expected output is not present.

As an example, say you have a config file config.yaml specifying to run preprocessing and assembly of your samples and you run the workflow as such:

snakemake --use-conda -j 4 --configfile config.yaml

When the workflow is finished you can then generate a report by running:

snakemake --use-conda -j 4 --configfile config.yaml --report report.html

Examples

Here are a few examples of how to run the workflow. They are written in a structure showing the relevant configuration parameters, the command(s) to run and the expected output. All examples assume you have a configuration file called config.yaml with the appropriate parameters, but you may of course use any config file name you want. A suggestion is to make a copy of the default config file and make your changes in the copy.

Assembly-based analysis

Assemble reads with Megahit

Configuration

assembly:
# run Megahit assembler?
megahit: True
# Use Metaspades instead of Megahit for assembly?
metaspades: False

megahit:
  # maximum threads for megahit
  threads: 20
  # keep intermediate contigs from Megahit?
  keep_intermediate: False
  # extra settings passed to Megahit
  extra_settings: "--min-contig-len 300 --prune-level 3"

Command:

snakemake --use-conda --configfile config.yaml -j 4 -p assemble

Output

results
|- assembly/
|  |- <assembly1>/final_contigs.fa   the fasta file with assembled contigs
|  |- ...
|  |- <assemblyN>/final_contigs.fa
|- report/
|  |- assembly/
|  |  |- assembly_stats.txt          table of assembly statistics
|  |  |- assembly_size_dist.txt      file with sizes of assemblies contained at different contig lengths
|  |  |- assembly_stats.pdf          a plot of general assembly statistics
|  |  |- assembly_size_dist.pdf      a plot of the size distribution of the assembly
|  |  |- alignment_frequency.pdf     a plot of the overall alignment frequency after mapping reads to assembled contigs

To use the Metaspades assembler, simply change your config file to:

assembly:
    metaspades: True
    megahit: False

Protein-level annotations

Open reading frames called on assembled contigs can be annotated using eggnog-mapper, pfam_scan and rgi (Resistance Gene Identifier). If you are running the workflow on the Uppmax compute cluster you can use centrally installed databases for the first two of these, see more under the section Running the workflow on Uppmax.

Configuration

Using these settings in your config file runs all three tools to annotate protein sequences in your assemblies.

annotation:
  # run eggnog-mapper to infer KEGG orthologs, pathways and modules?
  eggnog: True
  # run PFAM-scan to infer protein families from PFAM?
  pfam: True
  # run Resistance gene identifier?
  rgi: True

Command:

snakemake --use-conda --configfile config.yaml -j 4 -p annotate

Read-based analysis

Metaphlan

The workflow runs the recently released version 3 of Metaphlan. MetaPhlAn aligns reads to a set of core marker genes and estimates abundances of taxonomic clades in your samples.

Configuration

classification:
    metaphlan: True

Command:

snakemake --use-conda --configfile config.yaml -j 4 -p classify

Output

results
|- metaphlan/               raw, per sample output from metaphlan
|
|- report/
|- metaphlan/
|  |- metaphlan.tsv         clade relative abundances per sample
|  |- metaphlan.pdf         clustermap of relative abundance summed to <metaphlan_plot_rank>
|  |- metaphlan.html        Krona interactive plot (Linux only)

Kraken2

There are pre-built kraken databases available at https://benlangmead.github.io/aws-indexes/k2. To make use of e.g. the Greengenes prebuilt database, copy its HTTPS url and edit your config file to contain:

kraken:
  standard_db: False
  prebuilt: "16S_Greengenes"
  prebuilt_url: "<HTTPS url>" # <- Add the URL here

Configuration

classification:
    kraken: True

Command:

snakemake --use-conda --configfile config.yaml -j 4 -p classify

Output

results
|- kraken/                  raw, per sample output from kraken2
|
|- report/
|- kraken/
|  |- kraken.krona.html     Krona interactive plot (Linux only)