Preprocessing reads¶
Read preprocessing can be run as a single part of the workflow using the command:
snakemake --configfile <yourconfigfile> preprocess
Input reads can be trimmed using either Trimmomatic or Cutadapt.
Trimmomatic¶
The settings specific to Trimmomatic are:
trimmomatic:
Set to True
to preprocess reads using Trimmomatic.
trimmomatic_home:
The directory where Trimmomatic stores the .jar file and adapter sequences. If you don’t know it
leave it blank to let the pipeline attempt to locate it.
trim_adapters:
Set to ‘True’ to perform adapter trimming.
pe_adapter_params:
The adapter trim settings for paired end reads. This is what follows the ‘ILLUMINACLIP’ flag.
The default “2:30:15” will look for seeds with a maximum of 2 mismatches, and clip if extended seeds reach a score
of 30 for paired-end reads or 10.
pe_pre_adapter_params:
Trim settings to be performed prior to adapter trimming. See the
Trimmomatic manual for
possible settings. As an example, to trim the first 10 bp from the start of reads set this to HEADCROP:10
.
pe_post_adapter_params:
Trim settings to be performed after adapter trimming. To for instance set a 50 bp
threshold on the minimum lenghts of reads after all trimming is done, set this to MINLEN:50
.
The se_adapter_params:
and se_post_adapter_params:
settings are the same as above but for single-end reads.
Cutadapt¶
cutadapt:
Set to True
to run preprocessing with cutadapt.
Note
Trimmomatic has priority in the preprocessing so if both Trimmomatic and Cutadapt are set to True, only Trimmomatic will be run.
adapter_sequence:
Adapter sequence for trimming. By default the workflow uses the Illumina TruSeq Universal Adapter.
rev_adapter_sequence:
3’ adapter to be removed from second read in a pair.
cutadapt_error_rate:
Maximum allowed error rate as value between 0 and 1. Defaults to 0.1. Increasing this value removes more adapters.
Phix filtering¶
phix_filter
: Set to True to filter out sequences mapping to the PhiX genome.
Fastuniq¶
fastuniq:
Set to True to run de-duplication of paired reads using Fastuniq.
Note
Fastuniq only runs with paired-end reads so if your data contains single-end samples the sequences will just be propagated downstream without Fastuniq processing.
SortMeRNA¶
SortMeRNA finds rRNA reads by aligning to several rRNA databases. It can output aligning, rRNA, reads and non-aligning, non_rRNA, reads to different output files allowing you to filter your sequences.
sortmerna:
Set to True
to filter your raw reads with SortMeRNA.
sortmerna_keep:
Sortmerna produces files with reads aligning to rRNA (‘rRNA’ extension)
and not aligning to rRNA (‘non_rRNA’) extension. With the sortmerna_keep
setting
you specify which set of sequences you want to use for downstream analyses
(‘non_rRNA’ or ‘rRNA’)
sortmerna_remove_filtered:
Set to True to remove the filtered reads (i.e. the reads NOT specified
in ‘keep:’)
sortmerna_dbs:
Databases to use for rRNA filtering. Can include:
- rfam-5s-database-id98.fasta
- rfam-5.8s-database-id98.fasta
- silva-arc-16s-id95.fasta
- silva-arc-23s-id98.fasta
- silva-bac-16s-id90.fasta
- silva-bac-23s-id98.fasta
- silva-euk-18s-id95.fasta
- silva-euk-28s-id98.fasta
sortmerna_paired_strategy:
How to handle read-pairs where mates are classified differently. If set to
paired_in
both reads in a pair are put into the ‘rRNA’ bin if one of them aligns (i.e. more strict)
while paired_out
puts both reads in the ‘other’ bin.
sortmerna_params:
Extra parameters to use for the sortmerna step.
Markduplicates¶
This is technically post-processing but to remove duplicates prior to producing read counts of ORFs called on assembled
contigs you can set markduplicates:True
. The picard_jar
and picard_path
can most often be left blank as the workflow will automatically identify
these paths in your conda environment. However, if you run into trouble with this step try searching for
picard.jar
and its directory.