Paired-end sequencing

This tutorial describes a standard micca pipeline for the analysis of overlapping paired-end Illumina data. This pipeline is intended for different platforms, such as Illumina MiSeq and Illumina HiSeq. Although this tutorial explains how to apply the pipeline to 16S rRNA amplicons, it can be adapted to others markers gene/spacers, e.g. Internal Transcribed Spacer (ITS) or 28S.

Dataset download

This paired-end 16S rRNA dataset contains 3 samples in FASTQ format (V3-V4 region, 341-Forward 5’-CCTACGGGNGGCWGCAG-3’, 805-Reverse 5’-GACTACNVGGGTWTCTAATCC-3’). The 300-bp paired-end sequencing was carried out on an Illumina MiSeq.

Open a terminal, download the data and prepare the working directory:

wget ftp://ftp.fmach.it/metagenomics/micca/examples/pairedend.tar.gz
tar -zxvf pairedend.tar.gz
cd pairedend

Merge paired-end sequences

Now the paired sequences must be merged to obtain consensus sequences (sometimes called assembly). This operation can be performed with the mergepairs command. After the merging of the paired reads, the mergepairs command merges the different samples in a single file where sample names are appended to the sequence identifier, as in merge and split. Passing the forward files only reverse file names will be constructed by replacing the string _R1 in the forward file name with _R2 (typical in Illumina file names, see options -p/--pattern and -e/--repl).

Since the sequenced region is about of 464-bp (805-341) and the reads are of 300-bp, the overlap region is quite large (~136 bp), as rule of thumb we set a minimum overlap length of 100 and maximum number of allowed mismatches of about 1/3, say 32:

micca mergepairs -i fastq/*_R1*.fastq -o merged.fastq -l 100 -d 32

Note

Starting from micca 1.6.0 staggered read pairs (staggered pairs are pairs where the 3’ end of the reverse read has an overhang to the left of the 5’ end of the forward read) will be merged by default. To override this feature (and therefore to discard staggered alignments) set the -n/--nostagger option.

Note

mergepairs works with FASTQ files only.

Primer trimming

Segments which match PCR primers should be now removed. For Illumina paired-end (already merged) reads, we recommend to trim both forward and reverse primers and discard reads that do not contain the forward OR the reverse primer. Moreover, sequence preceding (for the forward) or succeding (for the reverse) primers should be removed:

_images/readpaired.png

These operations can be performed with the trim command:

micca trim -i merged.fastq -o trimmed.fastq -w CCTACGGGNGGCWGCAG -r GACTACNVGGGTWTCTAATCC -W -R -c

The option -W/--duforward and -R/--dureverse ensures that reads that do not contain the forward or the reverse primer will be discarded. With the option -c/--searchrc the command searches reverse complement primers too.

Quality filtering

Producing high-quality OTUs requires high-quality reads. filter filters sequences according to the maximum allowed expected error (EE) rate % (see Quality filtering strategy in micca). We recommend values <=1%.

For paired-end reads, we recommend to merge pairs first, then quality filter using a maximum EE threshold with no length truncation.

Warning

Parameters for the filter command should be chosen using the tool filterstats.

Choosing parameters for filtering

The command filterstats reports the fraction of reads that would pass for each specified maximum expected error (EE) rate %:

micca filterstats -i trimmed.fastq -o filterstats

Open the PNG file filterstats/stats_plot.png:

_images/filterstatspaired.png

In this case we are interested in the plot on top (minimum length filtering only). A truncation length of 400 and a maximum error rate of 0.5% seems to be a good compromise between the expected error rate and the number of reads remaining. Inspecting the file filterstats/minlen_stats.txt, you can see that more than 73% reads will pass the filter:

L       0.25    0.5     0.75    1.0     1.25    1.5
...
399  54.801  73.016  83.476  90.107  94.312  96.917
400  54.799  73.013  83.473  90.104  94.309  96.914
401  54.781  72.993  83.452  90.080  94.285  96.890
...

Note

To obtain general sequencing statistics, run stats.

Filter sequences

Now we can run the filter command with the selected parameters:

micca filter -i trimmed.fastq -o filtered.fasta -e 0.5 -m 400

Note

The maximum number of allowed Ns after truncation can be also specified in filterstats and in filter.