classify¶

usage: micca classify [-h] -i FILE -o FILE [-m {cons,rdp,otuid}] [-r FILE]
                    [-x FILE] [--cons-id CONS_ID]
                    [--cons-maxhits CONS_MAXHITS]
                    [--cons-minfrac CONS_MINFRAC]
                    [--cons-mincov CONS_MINCOV] [--cons-strand {both,plus}]
                    [--cons-threads THREADS]
                    [--rdp-gene {16srrna,fungallsu,fungalits_warcup,fungalits_unite}]
                    [--rdp-maxmem GB] [--rdp-minconf RDP_MINCONF]

micca classify assigns taxonomy for each sequence in the input file
and provides three methods for classification:

* VSEARCH-based consensus classifier (cons): input sequences are
searched in the reference database with VSEARCH
(https://github.com/torognes/vsearch). For each query sequence the
method retrives up to 'cons-maxhits' hits (i.e. identity >=
'cons-id'). Then, the most specific taxonomic label that is
associated with at least 'cons-minfrac' of the hits is
assigned. The method is similar to the UCLUST-based consensus
taxonomy assigner presented in doi: 10.7287/peerj.preprints.934v2
and available in QIIME.

* RDP classifier (rdp): only RDP classifier version >= 2.8 is
supported (doi:10.1128/AEM.00062-07). In order to use this
classifier RDP must be installed (download at
http://sourceforge.net/projects/rdp-classifier/files/rdp-classifier/)
and the RDPPATH environment variable setted. The available
databases (--rdp-gene) are:

- 16S (16srrna)
- Fungal LSU (28S) (fungallsu)
- Warcup ITS (fungalits_warcup, doi: 10.3852/14-293)
- UNITE ITS (fungalits_unite)

For more information about the RDP classifier go to
http://rdp.cme.msu.edu/classifier/classifier.jsp

* OTU ID classifier (otuid): simply perform a sequence ID matching
with the reference taxonomy file. Recommended strategy when the
closed reference clustering (--method closedref in micca-otu) was
performed. OTU ID classifier requires a tab-delimited file where
the first column contains the current OTU ids and the second column
the reference taxonomy ids (see otuids.txt in micca-otu), e.g.:

REF1[TAB]1110191
REF2[TAB]1104777
REF3[TAB]1078527
...

The input reference taxonomy file (--ref-tax) should be a
tab-delimited file where rows are either in the form:

1. SEQID[TAB]k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__;g__;
2. SEQID[TAB]Bacteria;Firmicutes;Clostridia;Clostridiales;;;
3. SEQID[TAB]Bacteria;Firmicutes;Clostridia;Clostridiales
4. SEQID[TAB]D_0__Bacteria;D_1__Firmicutes;D_2__Clostridia;D_3__Clostridiales;D_4__;D_5__;

Compatible reference database are Greengenes
(http://greengenes.secondgenome.com/downloads), QIIME-formatted SILVA
(https://www.arb-silva.de/download/archive/qiime/) and UNITE
(https://unite.ut.ee/repository.php).

The output file is a tab-delimited file where each row is in the
format:

SEQID[TAB]Bacteria;Firmicutes;Clostridia;Clostridiales

optional arguments:
-h, --help            show this help message and exit

arguments:
-i FILE, --input FILE
                        input FASTA file (for 'cons' and 'rdp') or a tab-
                        delimited OTU ids file (for 'otuid') (required).
-o FILE, --output FILE
                        output taxonomy file (required).
-m {cons,rdp,otuid}, --method {cons,rdp,otuid}
                        classification method (default cons)
-r FILE, --ref FILE   reference sequences in FASTA format, required for
                        'cons' classifier.
-x FILE, --ref-tax FILE
                        tab-separated reference taxonomy file, required for
                        'cons' and 'otuid' classifiers.

VSEARCH-based consensus classifierspecific options:
--cons-id CONS_ID     sequence identity threshold (0.0 to 1.0, default 0.9).
--cons-maxhits CONS_MAXHITS
                        number of hits to consider (>=1, default 3).
--cons-minfrac CONS_MINFRAC
                        for each taxonomic rank, a specific taxa will be
                        assigned if it is present in at least MINFRAC of the
                        hits (0.0 to 1.0, default 0.5).
--cons-mincov CONS_MINCOV
                        reject sequence if the fraction of alignment to the
                        reference sequence is lower than MINCOV. This
                        parameter prevents low-coverage alignments at the end
                        of the sequences (default 0.75).
--cons-strand {both,plus}
                        search both strands or the plus strand only (default
                        both).
--cons-threads THREADS
                        number of threads to use (1 to 256, default 1).

RDP Classifier/Database specific options:
--rdp-gene {16srrna,fungallsu,fungalits_warcup,fungalits_unite}
                        marker gene/region
--rdp-maxmem GB       maximum memory size for the java virtual machine in GB
                        (default 2)
--rdp-minconf RDP_MINCONF
                        minimum confidence value to assign taxonomy to a
                        sequence (default 0.8)

Examples

Classification of 16S sequences using the consensus classifier and
Greengenes:

    micca classify -m cons -i input.fasta -o tax.txt \
    --ref greengenes_2013_05/rep_set/97_otus.fasta \
    --ref-tax greengenes_2013_05/taxonomy/97_otu_taxonomy.txt

Classification of ITS sequences using the RDP classifier and the
UNITE database:

    micca classify -m rdp --rdp-gene fungalits_unite -i input.fasta \
    -o tax.txt

OTU ID matching after the closed reference OTU picking protocol:

    micca classify -m otuid -i otuids.txt -o tax.txt \
    --ref-tax greengenes_2013_05/taxonomy/97_otu_taxonomy.txt