Running the workflow on Uppmax¶
If you will be running the workflow on the Uppmax HPC clusters here are some helpful tips.
Setting up database files¶
All resources required to run this workflow are automatically downloaded as needed when the workflow is executed. However, some of these (often large) files are already installed in a central location on the system at /sw/data. This means that you can make use of them for your workflow runs thereby saving you time and reducing overall disk-usage for the system.
First create a resource/
directory inside the directory where you will be
running the workflow:
mkdir resources
eggNOG database¶
The 5.0
version of the eggNOG database is installed in a central location on
Uppmax under /sw/data/uppnex/eggNOG/5.0
. To make use of this database
simply symlink the files in the directory into a directory resources/eggnog-mapper
:
mkdir resources/eggnog-mapper
ln -s /sw/data/eggNOG/5.0/* resources/eggnog-mapper
head -1 /sw/data/eggNOG/eggNOG_data-5.0_install-README.md > resources/eggnog-mapper/eggnog.version
touch resources/eggnog-mapper/download.log
Pfam database¶
To use the Pfam database from the central location, create a pfam
sub-directory under resources
and link the necessary files from the central
location, run the following:
mkdir resources/pfam
ln -s /sw/data/Pfam/31.0/Pfam-A.hmm* resources/pfam/
cat /sw/data/Pfam/31.0/Pfam.version > resources/pfam/Pfam-A.version
This installs the necessary files for release 31.0
. Check the directories
under /sw/data/Pfam/
to see available release versions.
Kraken database¶
For kraken
there are a number of databases installed under /sw/data/Kraken2
.
Snapshots of the standard
, nt
, rdp
, silva
and greengenes
indices are installed on a monthly basis. To use the latest version of the
standard index, do the following:
- Create a sub-directory and link the index files
mkdir -p resources/kraken/standard
ln -s /sw/data/Kraken2/latest/*.k2d resources/kraken/standard/
2. From a reproducibility perspective it’s essential to keep track of when the index was created, so generate a version file inside your kraken directory by running:
file /sw/data/Kraken2/latest | egrep -o "[0-9]{8}\-[0-9]{6}" > resources/kraken/standard/kraken.version
- Modify your config file so that it contains:
classification:
kraken: True
kraken:
# generate the standard kraken database?
standard_db: True
Non-standard databases¶
Using any of the other non-standard databases from the central location is also a simple process, e.g. for the latest SILVA index:
mkdir -p resources/kraken/silva
ln -s /sw/data/Kraken2/latest_silva/*.k2d resources/kraken/silva/
file /sw/data/Kraken2/latest_silva | egrep -o "[0-9]{8}\-[0-9]{6}" > resources/kraken/silva/kraken.version
Then update your config file with:
kraken:
standard_db: False
custom: "resources/kraken/silva"
Centrifuge database¶
There are several centrifuge indices on Uppmax located at /sw/data/uppnex/Centrifuge-indices/20180720/
,
but keep in mind that they are from 2018.
The available indices are: p_compressed
, p_compressed+h+v
, p+h+v
,
and p+v
(see the centrifuge manual
for information on what these contain).
To use the p_compressed
index, run:
mkdir -p resources/centrifuge/p_compressed/
ln -s /sw/data/Centrifuge-indices/20180720/p_compressed.*.cf resources/centrifuge/p_compressed/
Then update your config file to contain:
classification:
centrifuge: True
centrifuge:
custom: "resources/centrifuge/p_compressed/p_compressed"
GTDB¶
To use the centrally installed Genome Taxonomy Database (GTDB) release on Uppmax, do:
mkdir -p resources/gtdb
ln -s /sw/data/GTDB/R04-RS89/rackham/release89/* resources/gtdb/
Then make sure your config file contains:
binning:
gtdbtk: True
nr¶
Uppmax provides monthly snapshots of the nr
non-redundant database. While
the formatted file cannot be used directly with the nbis-meta
workflow you
can save time by making use of the already downloaded fasta file.
To use the latest snapshot of nr
for taxonomic annotation of contigs, do:
mkdir resources/nr
ln -s /sw/data/diamond_databases/Blast/latest/download/nr.gz resources/nr/nr.fasta.gz
file /sw/data/diamond_databases/Blast/latest | egrep -o "[0-9]{8}-[0-9]{6}" > resources/nr/nr.version
Then update the taxonomy
section in your config file to use the nr
database:
taxonomy:
database: "nr"
Uniref90¶
The UniRef90 database is clustered at 90% sequence identity and Uppmax provides downloaded fasta files that can be used directly with the workflow.
To use the latest snapshot of UniRef90
for taxonomic annotation of contigs,
do:
mkdir resources/uniref90
ln -s /sw/data/diamond_databases/UniRef90/latest/download/uniref90.fasta.gz resources/uniref90/uniref90.fasta.gz
file /sw/data/diamond_databases/UniRef90/latest | egrep -o "[0-9]{4}_[0-9]{2}" > resources/uniref90/uniref90.version
Then update the taxonomy
section in your config file to use the uniref90
database:
taxonomy:
database: "uniref90"
Configure workflow for the SLURM Workload Manager¶
The workflow comes with the SLURM snakemake profile
pre-installed. All you have to do is to modify the config/cluster.yaml
file
and insert your cluster account ID:
__default__:
account: staff # <-- exchange staff with your SLURM account id
Then you can run the workflow with --profile slurm
from the root of the
workflow directory, e.g.:
snakemake --use-conda --profile slurm -j 100 --configfile myconfig.yaml
Here the -j 100
flag means that snakemake can have at most 100 jobs in the
queue at the same time for this run.