Getting started

The configuration file

The workflow comes with a configuration file config/config.yaml. You can either modify this file directly or create a new config file and point snakemake to it with snakemake --configfile <your-config-file>.yaml. The configuration is validated with a schema found under workflow/schemas/config.schema.yaml which is also used to set default values. Specifying a configfile with --configfile extends and overwrites the default settings so that any configfile you specify only needs to contain the parameters you want to change.

For example, the default config file begins with:

sample_list: config/samples.tsv

paths:
  results: "results"

A new config file, named e.g. conf_test.yaml, and containing only the lines:

paths:
    results: "test"

will then use test/ as output directory for results when the workflow is run as:

snakemake --use-conda --configfile conf_test.yaml -j 4

The sample list

The workflow requires you to supply a file with some information on your samples. The path to this file is specified with the sample_list: parameter in the configuration file.

As a minimum the file must contain the columns sample, unit and fq1. Paired-end samples also require the fq2 column for the mate 2 read file.

A very basic sample file may look like this:

sample unit fq1 fq2
sample1 1 examples/data/sample1_1_R1.fastq.gz examples/data/sample1_1_R2.fastq.gz
sample2 1 examples/data/sample2_11_R1.fastq.gz examples/data/sample2_11_R2.fastq.gz
sample3 1 examples/data/sample3_21_R1.fastq.gz  

The columns fq1 and fq2 (for paired-end data) specify the paths to fastq files which will be used as input to the workflow.

Specifying assemblies

You may also include a column named assembly with names for assemblies to create. The assembly field can be comma-separated entries specifying several assemblies per sample/unit combination.

For instance, with a sample file like this:

sample unit assembly fq1 fq2
sample1 1 sample1,all examples/data/sample1_1_R1.fastq.gz examples/data/sample1_1_R2.fastq.gz
sample2 1 sample2,all examples/data/sample2_11_R1.fastq.gz examples/data/sample2_11_R2.fastq.gz
sample3 1 sample3,all examples/data/sample3_21_R1.fastq.gz  

a total of 4 assemblies will be generated: - sample1, sample2 and sample3 with input only from each sample respectively, and - all with input from all samples