Databases:)

ganon automates the download, update and build of databases based on NCBI RefSeq and GenBank genomes repositories wtih ganon build and update commands, for example:

ganon build -g archaea bacteria -d arc_bac -c -t 30

This will download archaeal and bacterial complete genomes from RefSeq and build a database with 30 threads. Some day later, the database can be updated to include newest genomes with:

ganon update -d arc_bac -t 30

Additionally, custom databases can be built with customized files and identifiers with the ganon build-custom command.

Info

We DO NOT provide pre-built indices for download. ganon can build databases very efficiently. This way, you will always have up-to-date reference sequences and get most out of your data.

RefSeq and GenBank:)

NCBI RefSeq and GenBank repositories are common resources to obtain reference sequences to analyze metagenomics data. They are mainly divided into domains/organism groups (e.g. archaea, bacteria, fungi, ...) but can be further filtered. The choice of those filters can drastically change the outcome of results.

Commonly used sub-sets:)

RefSeq (2023-03-14) # assemblies # species Size* ganon build
All genomes 295219 52781 160
ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --db-prefix abfv_rs
All genomes - 1 assembly/species 52781 52781 128
ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --genome-updater "-A 'species:1'" --db-prefix abfv_rs_t1s
Complete genomes 44121 19715 35
ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes --db-prefix abfv_rs_cg
Complete genomes - 1 assembly/species 19715 19715 29
ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes --genome-updater "-A 'species:1'" --db-prefix abfv_rs_cg_t1s
Representative genomes 18073 18073 69
ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --representative-genomes --db-prefix abfv_rs_rg
GenBank (2023-03-14) # assemblies # species Size* ganon build
All genomes 1595845 99505 -
ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --db-prefix abfv_gb
All genomes - 1 assembly/species 99505 99505 300
ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --genome-updater "-A 'species:1'" --db-prefix abfv_gb_t1s
Complete genomes 92917 34815 42
ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes --db-prefix abfv_gb_cg
Complete genomes - 1 assembly/species 34815 34815 34
ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes "-A 'species:1'" --db-prefix abfv_gb_cg_t1s

Info

Data obtained in 2023-03-14 for archaea, bacteria, fungi and viral groups only. By the time you are reading this, those numbers certainly grew a bit. The commands provided will download up-to-date assemblies and will require slightly larger resources.

GTDB R214 # assemblies # species Size* ganon build
All genomes 402709 85205 260
ganon build --source refseq genbank --organism-group archaea bacteria --threads 48 --taxonomy gtdb --db-prefix ab_gtdb
All genomes - 1 assembly/species 85205 85205 213
ganon build --source refseq genbank --organism-group archaea bacteria --threads 48 --taxonomy gtdb --top 1 --db-prefix ab_gtdb_t1s

Info

GTDB covers only bacteria and archaea groups and has assemblies from RefSeq and GenBank.

* in GB -> ganon requires up-to 2x the database size of memory to build it. The memory required to use it in classification is approx. the same as the database size.

  • As a rule of thumb, the more the better, so choose the most comprehensive sub-set as possible given your computational resources
  • It is possible to build databases that consume a fixed size/RAM usage. Beware that smaller filters will increase the false positive rates when classifying. Other approaches can reduce the size/RAM requirements with some trade-offs.
  • Alternatively, you can build one database for each organism group separately and use them in ganon classify in any order or even stack them hierarchically. This way combination of multiple databases are possible, extending use cases.

Further examples of commonly used database can be found here.

Specific organisms or taxonomic groups:)

It is also possible to generate databases for specific organisms or taxonomic branches with -a/--taxid, for example:

ganon build --source refseq --taxid 562 317 --threads 48 --db-prefix coli_syringae

will download and build a database for all Escherichia coli (taxid:562) and Pseudomonas syringae (taxid:317) assemblies from RefSeq.

More filter options:)

ganon uses genome_updater to manage downloads and further specific options and filters can be provided with the paramer -u/--genome-updater, for example:

ganon build -g bacteria -t 48 -d bac_refseq --genome-updater "-A 'genus:3' -E 20230101"

will download top 3 archaeal assemblies for each genus with date before 2023-01-01. For more information about genome_updater parameters, please check the repository.

GTDB:)

By default, ganon will use the NCBI Taxonomy to build the database. However, GTDB is fully supported and can be used with the parameter --taxonomy gtdb.

Filtering by taxonomic entries also work with GTDB, for example:

ganon build --db-prefix fuso_gtdb --taxid "f__Fusobacteriaceae" --source refseq genbank --taxonomy gtdb --threads 12

Update (ganon update):)

Default ganon databases generated with the ganon build can be updated with ganon update. This procedure will download new files and re-generate the ganon database with the updated entries.

For example, a database generated with the following command:

ganon build --db-prefix arc_cg_rs --source refseq --organism-group archaea --complete-genomes --threads 12

will contain all archaeal complete genomes from NCBI RefSeq at the time of running. Some days later, the database can be updated, fetching only new sequences added to the NCBI repository with the command:

ganon update --db-prefix arc_cg_rs --threads 12

Tip

To not overwrite the current database and create a new one with the updated files, use the --output-db-prefix parameter.

Reproducibility:)

If you use ganon with default databases and want to re-generate it later or keep track of the content for reproducibility purposes, you can save the assembly_summary.txt file located inside the {output_prefix}_files/ directory. To re-download the exact same snapshot of files used, one could use genome_updater, for example:

genome_updater.sh -e assembly_summary.txt -f "genomic.fna.gz" -o recovered_files -m -t 12 

Reducing database size:)

Filter type (IBF and HIBF):)

The Hierarchical Interleaved Bloom Filter (HIBF) is an improvement over the default Interleaved Bloom Filter (IBF) and generates smaller databases with faster query times (article). However, the HIBF takes a little longer to build and has less flexibility regarding size and further options in ganon. You can choose which filter to use with the --filter-type parameter in ganon build and ganon build-custom.

Due to differences between the default IBF used in ganon and the HIBF, it is recommended to lower the false positive when using the HIBF. The default value for high sensitivity is 1% (--filter-type hibf --max-fp 0.001).

Hint

  • For large unbalanced reference sets, lots of reads to query -> HIBF (default)
  • For quick database build and more flexibility -> IBF

False positive rate:)

A higher --max-fp value will generate a smaller database but with a higher number of false positive matches on classification. More details. Values between 0.001 (0.1%) and 0.3 (30%) are generally used.

Hint

When using higher --max-fp values, more false positive results may be generated. This can be filtered with the --fpr-query parameter in ganon classify

k-mer and window size:)

Define how much unique information is stored in the database. More details

  • The smaller the --kmer-size, the less unique they will be, reducing database size but also sensitivity in classification.
  • The bigger the --window-size, the less information needs to be stored resulting in smaller databases but with decrease classification accuracy.

Top assemblies:)

RefSeq and GenBank are highly biased toward some few organisms. This means that some species are highly represented in number of assemblies compared to others. This can not only bias analysis but also brings redundancy to the database. Choosing a certain number of top assemblies can mitigate those issues. Database sizes can also be drastically reduced without this redundancy, but "strain-level" analysis are then not possible. We recommend using top assemblies for larger and comprehensive reference sets (like the ones listed above) and use the full set of assemblies for specific clade analysis.

Example

  • ganon build --top 1 will select one assembly for each taxonomic leaf (NCBI taxonomy still has strain, sub-species, ...)
  • ganon build --genome-updater "-A 'species:1'" will select one assembly for each species
  • ganon build --genome-updater "-A 'genus:3'" will select three assemblies for each genus

Split databases:)

Ganon allows classification with multiple databases in one level or in an hierarchy (More details). This means that databases can be built separately and used in any combination as desired. There are usually some benefits of doing so:

  • Smaller databases when building by organism group, for example: one for bacteria, another for viruses, ... since average genome sizes are quite different.
  • Easier to maintain and update.
  • Extend use cases and avoid misclassification due to contaminated databases.
  • Use databases as quality control, for example: remove reads matching one database of host or vectors (check out ganon report --skip-hierarchy).

Fixed size and Mode (only for --filter-type ibf):)

A fixed size for the database filter can be defined with --filter-size when using --filter-type ibf. The smaller the filter size, the higher the false positive chances on classification. When using a fixed filter size, ganon will report the max. and avg. false positive rate at the end of the build. More details.

--mode offers 5 different categories to build a database controlling the trade-off between size and classification speed.

  • avg: Balanced mode
  • smaller or smallest: create smaller databases with slower classification speed
  • fast or fastest: create bigger databases with faster classification speed

Warning

If --filter-size is used, smaller and smallest refers to the false positive and not to the database size (which is fixed).

Example:)

Besides the benefits of using HIBF and specific sub-sets of big repositories shown on the default databases table, examples of other reduction strategies with IBF can be seen below:

RefSeq archaeal complete genomes from 2023-05-05

Strategy Size (MB) Smaller Trade-off
default 318 - -
cmdganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --db-prefix arc_rs_cg --filter-type ibf
--mode smallest 301 5% Slower classification
cmdganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --mode smallest --db-prefix arc_rs_cg_smallest --filter-type ibf
--filter-size 256 256 19% Higher false positive on classification
cmdganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --filter-size 256 --db-prefix arc_rs_cg_fs256 --filter-type ibf
--window-size 35 249 21% Less sensitive classification
cmdganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --window-size 35 --db-prefix arc_rs_cg_ws35 --filter-type ibf
--max-fp 0.2 190 40% Higher false positive on classification
cmdganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --max-fp 0.2 --db-prefix arc_rs_cg_fp0.2 --filter-type ibf

Note

This is an illustrative example and the reduction proportions for different configuration may be quite different