Classification:)
ganon classify
will match single and/or paired-end sets of reads against one or more databases.
By default, parameters are optimized for taxonomic profiling, meaning that less reads will be classified but with a higher sensitivity. For example:
ganon classify --db-prefix my_db --paired-reads reads.1.fq.gz reads.2.fq.gz --output-prefix results --threads 32
Output files:
results.rep
: plain report of the run, used to further generate tree-like reportsresults.tre
: tree-like report with cumulative abundances by taxonomic ranks (can be re-generated withganon report
)
By default, ganon classify
only write report files. To get files with the classification of each read, use --output-one
and/or --output-all
. More information about output files here.
Note
ganon performs taxonomic profiling and/or binning (one tax. assignment for each read) at a taxonomic, strain or sequence level. Some guidelines are listed below, please choose the parameters according to your application.
Profiling:)
ganon classify
is set-up by default to perform taxonomic profiling. It uses:
- strict thresholds:
--rel-cutoff 0.75
and--rel-filter 0.1
--min-count 0.00005
(0.005%) to exclude very low abundant taxa--report-type abundance
to generate taxonomic abundances, correcting for genome sizes (more infos here)
Binning:)
To achieve better results for taxonomic binning or sequence classification, ganon classify
can be configured with --binning
, that is the same as:
- less strict thresholds:
--rel-cutoff 0.25 --rel-filter 0
--min-count 0
reports all taxa with at least one read assigned to it--report-type reads
will report sequence abundances instead of taxonomic abundances (more infos here)
Tip
Database parameters in ganon build
can also influence your results. Lower --max-fp
(e.g. 0.1, 0.001) and higher --kmer-size
(e.g. 23
, 27
) will improve sensitivity of your results at cost of a larger database and memory usage.
Reads with multiple matches:)
There are two ways to solve reads with multiple-matches in ganon classify
:
--multiple-matches em
(default): uses an Expectation-Maximization algorithm, re-assigning reads with multiple matches to one most probable target (defined by--level
in the build procedure).--multiple-matches lca
: uses the Lowest Common Ancestor algorithm, re-assigning reads with multiple matches to higher common ancestors in the taxonomic tree.--multiple-matches skip
: will not resolve multi-matching reads
Tip
- The Expectation-Maximization can be performed independently with
ganon reassign
using the output files.rep
and.all
. - Reports can be generated independently with
ganon report
using the output file.rep
Note
--multiple-matches lca
paired with --report-type abundance
or dist
will distribute read counts with multiple matches to one most probable target (defined by --level
in the build procedure), instead of a higher taxonomic rank. In this case the distribution is simply based on the number of taxa with unique matches and it is not as precise as the EM algorithm, but it will run faster since the per-read basis re-assignment can be skipped.
Classifying more reads:)
By default ganon will classify less reads in favour of sensitivity. To classify more reads, use less strict --rel-cutoff
and --rel-filter
values (e.g. 0.25
and 0
, respectively). More details here.
Multiple and Hierarchical classification:)
ganon classify
can be performed in multiple databases at the same time. The databases can also be provided in a hierarchical order.
Multiple database classification can be performed providing several inputs for --db-prefix
. They are required to be built with the same --kmer-size
and --window-size
values. Multiple databases are considered as one (as if built together) and redundancy in content (same reference in two or more databases) is allowed.
To classify reads in a hierarchical order, --hierarchy-labels
should be provided. When using multiple hierarchical levels, output files will be generated for each level (use --output-single
to generate a single output from multiple hierarchical levels). Please note that some parameters are set for each database (e.g. --rel-cutoff
) while others are set for each hierarchical level (e.g. --rel-filter
)
Examples
Classification against 3 database (as if they were one) using the same cutoff:
ganon classify --db-prefix db1 db2 db3 \
--rel-cutoff 0.75 \
--single-reads reads.fq.gz
Classification against 3 database (as if they were one) using different error rates for each:
ganon classify --db-prefix db1 db2 db3 \
--rel-cutoff 0.2 0.3 0.1 \
--single-reads reads.fq.gz
In this example, reads are going to be classified first against db1 and db2. Reads without a valid match will be further classified against db3. `--hierarchy-labels` are strings and are going to be sorted to define the hierarchy order, disregarding input order:
ganon classify --db-prefix db1 db2 db3 \
--hierarchy-labels 1_first 1_first 2_second \
--single-reads reads.fq.gz
In this example, classification will be performed with different `--rel-cutoff` for each database. For each hierarchy levels (`1_first` and `2_second`) a different `--rel-filter` will be used:
ganon classify --db-prefix db1 db2 db3 \
--hierarchy-labels 1_first 1_first 2_second \
--rel-cutoff 1 0.5 0.25 \
--rel-filter 0.1 0.5 \
--single-reads reads.fq.gz
Parameter details:)
reads (--single-reads, --paired-reads):)
ganon accepts single-end and paired-end reads. Both types can be use at the same time. In paired-end mode, reads are always reported with the header of the first pair. Paired-end reads are classified in a standard forward-reverse orientation. The max. read length accepted is 65535 (accounting both reads in paired mode).
cutoff and filter (--rel-cutoff, --rel-filter):)
ganon has two main parameters to control the number and strictess of matches between reads and references: --rel-cutoff
and --rel-filter
.
Every read can be classified against none, one or more references. ganon will report only matches after cutoff
and filter
thresholds are applied, based on the number of shared k-mers between sequences (use --rel-cutoff 0
and --rel-filter 1
to deactivate them).
The cutoff
is the first to be applied. It sets the min. percentage of k-mers of a read to be shared with a reference to consider a match. Next the filter
is applied to the remaining matches. filter
thresholds are relative to the best and worst scoring match after cutoff
and control the percentage of additional matches (if any) should be reported, sorted from the best to worst. filter
won't change the total number of matched reads but will change the amount of unique or multi-matched reads.
cutoff
can be interpreted as the lower bound to discard spurious matches and filter
as the fine tuning to control what to keep.
In summary:
--rel-cutoff
controls the strictness of the matching algorithm.- lower values -> more read matches
- higher values -> less read matches
--rel-filter
controls how many matches each read will have, from best to worst- lower values -> more unique matching reads
- higher values -> more multi-matching reads
For example, using a hypothetical number of k-mer matches, a certain read with 82 k-mers has the following matches with the 5 references (ref1..5
), sorted number of shared k-mers:
reference | shared k-mers |
---|---|
ref1 | 82 |
ref2 | 68 |
ref3 | 44 |
ref4 | 25 |
ref5 | 20 |
With --rel-cutoff 0.25
, the following matches will be discarded:
reference | shared k-mers | --rel-cutoff 0.25 |
---|---|---|
ref1 | 82 | |
ref2 | 68 | |
ref3 | 44 | |
ref4 | 25 | |
~~ref5~~ | ~~20~~ | X |
since the --rel-cutoff
threshold is 82 * 0.25 = 21
(ceiling is applied).
Next, with --rel-filter 0.5
, the following matches will be discarded:
reference | shared k-mers | --rel-cutoff 0.25 | --rel-filter 0.5 |
---|---|---|---|
ref1 | 82 | ||
ref2 | 68 | ||
~~ref3~~ | ~~44~~ | X | |
~~ref4~~ | ~~25~~ | X | |
~~ref5~~ | ~~20~~ | X |
since 82
is the best match and 25
is the worst remaining match, the filter will keep the top the remaining matches, based on the shared k-mers threshold 82 - ((82-25)*0.5) = 54
(ceiling is applied). ref1
and ref2
are reported as matches
Tip
The actual number of unique k-mers in a read are used as an upper bound to calculate the thresholds. The same is applied when using --window-size
and minimizers.
Note
A different --rel-cutoff
can be set for every database in a multiple or hierarchical database classification. A different --rel-filter
can be set for every level of a hierarchical database classification.
Note
Reads that remain with only one reference match (after cutoff
and filter
are applied) are considered a unique match.
False positive of a query (--fpr-query):)
ganon uses Bloom Filters, probabilistic data structures that may return false positive results. The base false positive of a ganon index is controlled by --max-fp
when building the database. However, this value is the expected false positive for each k-mer. In practice, a sequence (several k-mers) will have a way smaller false positive. ganon calculates the false positive rate of a query as suggested by (Solomon and Kingsford, 2016). The --fpr-query
will control the max. value accepted to consider a match between a sequence and a reference, avoiding false positives that may be introduce by the properties of the data structure.
By default, --fpr-query 1e-5
is used and it is applied after the --rel-cutoff
and --rel-filter
. Values between 1e-3
and 1e-10
are recommended. This threshold becomes more important when building smaller databases with higher --max-fp
, assuring that the false positive is under control. In this case however, sensitivity of results may decrease.
Note
The false positive of a query was first propose in: Solomon, Brad, and Carl Kingsford. “Fast Search of Thousands of Short-Read Sequencing Experiments.” Nature Biotechnology 34, no. 3 (2016): 1–6. https://doi.org/10.1038/nbt.3442.