Custom databases:)
Besides the automated download and build (ganon build
) ganon provides a highly customizable build procedure (ganon build-custom
) to create databases from local sequence files. The usage of this procedure depends on the configuration of your files:
- Filename like
GCA_002211645.1_ASM221164v1_genomic.fna.gz
: genomic fasta files in the NCBI standard, with assembly accession in the beginning of the filename. Provide the files with the--input
parameter. ganon will try to retrieve all necessary information to build the database. - Headers like
>NC_006297.1 Bacteroides fragilis YCH46 ...
: sequence headers are in the NCBI standard, with sequence accession in after>
and with a space afterwards (or line break). Provide the files with the--input
parameter and set--input-target sequence
. ganon will try to retrieve all necessary information to build the database. - For non-standard filenames and headers, follow this
Warning
--input-target sequence
will be slower to build and will use more disk space, since files have be re-written separately for each sequence. More information about building by file or sequence can be found here.
The --level
is a important parameter that will define the (max.) classification level for the database (more infos):
--level file
orsequence
-> default behavior (depending on--input-target
), use file/sequence as classification target--level assembly
-> will retrieve assembly related to the file/sequence, use assembly as classification target--level leaves
orspecies
,genus
,... -> group input by taxonomy, use tax. nodes at the rank chosen as classification target
More infos about other parameters here.
Non-standard files/headers with --input-file
:)
Alternatively to the automatic input methods, it is possible to manually define the input with either standard or non-standard filenames, accessions and headers to build custom databases with --input-file
. This file should contain the following fields (tab-separated):
file [<tab> target <tab> node <tab> specialization <tab> specialization_name].
file
: relative or full path to the sequence filetarget
: any unique text to name the file, to be used in the taxonomynode
: taxonomic node (e.g. taxid) to link entry with taxonomyspecialization
: creates a specialized taxonomic level with a custom name, allowing files to be groupedspecialization_name
: a name for the specialization, to be used in the taxonomy
Warning
the target
and specialization
fields (2nd and 4th col) cannot be the same as the node
(3rd col)
Below you find example of --input-file
. Note they are slightly different depending on the --input-target
chosen. They need to be tab-separated to be properly parsed (tsv).
Examples of --input-file
using the default --input-target file
:)
List of files:)
sequences.fasta
others.fasta
No taxonomic information is provided so --taxonomy skip
should be set. The classification against the generated database will be performed at file level (--level file
), since that is the only available information given.
List of files with alternative names:)
sequences.fasta sequences
others.fasta others
Just like above, but with a specific name to be used for each file.
Files and taxonomy:)
sequences.fasta sequences 562
others.fasta others 623
The classification max. level against this database will depend on the value set for --level
:
--level file
-> use the file (named with target) with node as parent--level leaves
orspecies
,genus
,... -> files are grouped by taxonomy
Files, taxonomy and specialization:)
sequences.fasta sequences 562 ID44444 Escherichia coli TW10119
others.fasta others 623 ID55555 Shigella flexneri 1a
The classification max. level against this database will depend on the value set for --level
:
--level custom
-> use the specialization (named with specialization_name) with node as parent--level file
-> use the file (named with target) as a tax. node as parent--level leaves
orspecies
,genus
,... -> files are grouped by taxonomy
Examples of --input-file
using --input-target sequence
:)
To provide a tabular information for every sequence in your files, you need to use the target
field (2nd col.) of the --input-file
to input sequence headers. For example:
Sequences and taxonomy:)
sequences.fasta NZ_CP054001.1 562
sequences.fasta NZ_CP117955.1 623
others.fasta header1 666
others.fasta header2 666
The classification max. level against this database will depend on the value set for --level
:
--level sequence
-> use the sequence header with node as parent--level assembly
-> will attempt to retrieve the assembly related to the sequence with node as parent--level leaves
orspecies
,genus
,... -> files are grouped by taxonomy
Sequences, taxonomy and specialization:)
sequences.fasta NZ_CP054001.1 562 ID44444 Escherichia coli TW10119
sequences.fasta NZ_CP117955.1 623 ID55555 Shigella flexneri 1a
others.fasta header1 666 StrainA My Strain
others.fasta header2 666 StrainA My Strain
The classification max. level against this database will depend on the value set for --level
:
--level custom
-> use the specialization (named with specialization_name) with node as parent--level sequence
-> use the sequence header with node as parent--level leaves
orspecies
,genus
,... -> files are grouped by taxonomy
Examples:)
Below you will find some examples from commonly used repositories for metagenomics analysis with ganon build-custom
:
HumGut:)
Collection of >30000 genomes from healthy human metagenomes. Article/Website.
# Download sequence files
wget --quiet --show-progress "http://arken.nmbu.no/~larssn/humgut/HumGut.tar.gz"
tar xf HumGut.tar.gz
# Download taxonomy and metadata files
wget "https://arken.nmbu.no/~larssn/humgut/ncbi_nodes.dmp"
wget "https://arken.nmbu.no/~larssn/humgut/ncbi_names.dmp"
wget "https://arken.nmbu.no/~larssn/humgut/HumGut.tsv"
# Generate --input-file from metadata
tail -n+2 HumGut.tsv | awk -F"\t" '{print "fna/"$21"\t"$1"\t"$2}' > HumGut_ganon_input_file.tsv
# Build ganon database
ganon build-custom --input-file HumGut_ganon_input_file.tsv --taxonomy-files ncbi_nodes.dmp ncbi_names.dmp --db-prefix HumGut --level strain --threads 32
Similarly using GTDB taxonomy files:
# Download taxonomy files
wget "https://arken.nmbu.no/~larssn/humgut/gtdb_nodes.dmp"
wget "https://arken.nmbu.no/~larssn/humgut/gtdb_names.dmp"
# Build ganon database
ganon build-custom --input-file HumGut_ganon_input_file.tsv --taxonomy-files gtdb_nodes.dmp gtdb_names.dmp --db-prefix HumGut_gtdb --level strain --threads 32
Note
There is no need to use ganon's gtdb integration here since GTDB files in NCBI format are available
Plasmid, Plastid and Mitochondrion from RefSeq:)
Extra repositories from RefSeq release not included as default databases. Website.
# Download sequence files
wget -A genomic.fna.gz -m -nd --quiet --show-progress "ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plasmid/"
wget -A genomic.fna.gz -m -nd --quiet --show-progress "ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plastid/"
wget -A genomic.fna.gz -m -nd --quiet --show-progress "ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/mitochondrion/"
ganon build-custom --input plasmid.* plastid.* mitochondrion.* --db-prefix ppm --level species --threads 8 --input-target sequence
UniVec, UniVec_core:)
"UniVec is a non-redundant database of sequences commonly attached to cDNA or genomic DNA during the cloning process." Website. Useful to screen for vector and linker/adapter contamination. UniVec_core is a sub-set of the UniVec selected to reduce the false positive hits from real biological sources.
# UniVec
wget -O "UniVec.fasta" --quiet --show-progress "ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/UniVec"
echo -e "UniVec.fasta\tUniVec\t81077" > UniVec_ganon_input_file.tsv
ganon build-custom --input-file UniVec_ganon_input_file.tsv --db-prefix UniVec --level leaves --threads 8 --skip-genome-size
# UniVec_Core
wget -O "UniVec_Core.fasta" --quiet --show-progress "ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/UniVec_Core"
echo -e "UniVec_Core.fasta\tUniVec_Core\t81077" > UniVec_Core_ganon_input_file.tsv
ganon build-custom --input-file UniVec_Core_ganon_input_file.tsv --db-prefix UniVec_Core --level leaves --threads 8 --skip-genome-size
Note
All UniVec entries in the examples are categorized as Artificial Sequence (NCBI txid:81077). Some are completely artificial but others may be derived from real biological sources. More information in this link.
MGnify genome catalogues (MAGs):)
"Genome catalogues are biome-specific collections of metagenomic-assembled and isolate genomes". Article/Website/FTP.
Currently available genome catalogues (2024-02-09): chicken-gut
cow-rumen
honeybee-gut
human-gut
human-oral
human-vaginal
marine
mouse-gut
non-model-fish-gut
pig-gut
zebrafish-fecal
List currently available entries curl --silent --list-only ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/
Example on how to download and build the human-oral
catalog:
# Download metadata
wget "https://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/human-oral/v1.0.1/genomes-all_metadata.tsv"
# Download sequence files with 12 threads
tail -n+2 genomes-all_metadata.tsv | cut -f 1,20 | xargs -P 12 -n2 sh -c 'curl --silent ${1}| gzip -d | sed -e "1,/##FASTA/ d" | gzip > ${0}.fna.gz'
# Generate ganon input file
tail -n+2 genomes-all_metadata.tsv | cut -f 1,15 | tr ';' '\t' | awk -F"\t" '{tax="1";for(i=NF;i>1;i--){if(length($i)>3){tax=$i;break;}};print $1".fna.gz\t"$1"\t"tax}' > ganon_input_file.tsv
# Build ganon database
ganon build-custom --input-file ganon_input_file.tsv --db-prefix mgnify_human_oral_v101 --taxonomy gtdb --level leaves --threads 8
Note
MGnify genomes catalogues will be build with GTDB taxonomy.
Pathogen detection FDA-ARGOS:)
A collection of >1400 "microbes that include biothreat microorganisms, common clinical pathogens and closely related species". Article/Website/BioProject.
# Download sequence files
wget https://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt
grep "strain=FDAARGOS" assembly_summary_refseq.txt > fdaargos_assembly_summary.txt
genome_updater.sh -e fdaargos_assembly_summary.txt -f "genomic.fna.gz" -o download -m -t 12
# Build ganon database
ganon build-custom --input download/ --input-recursive --db-prefix fdaargos --ncbi-file-info download/assembly_summary.txt --level assembly --threads 32
Note
The example above uses genome_updater to download files
BLAST databases (nt env_nt nt_prok ...):)
Current available nucleotide databases (2024-02-09): 16S_ribosomal_RNA
18S_fungal_sequences
28S_fungal_sequences
Betacoronavirus
env_nt
human_genome
ITS_eukaryote_sequences
ITS_RefSeq_Fungi
LSU_eukaryote_rRNA
LSU_prokaryote_rRNA
mito
mouse_genome
nt
nt_euk
nt_others
nt_prok
nt_viruses
patnt
pdbnt
ref_euk_rep_genomes
ref_prok_rep_genomes
refseq_rna
refseq_select_rna
ref_viroids_rep_genomes
ref_viruses_rep_genomes
SSU_eukaryote_rRNA
tsa_nt
List currently available entries curl --silent --list-only ftp://ftp.ncbi.nlm.nih.gov/blast/db/ | grep "nucl-metadata.json" | sed 's/-nucl-metadata.json/, /g' | sort
Warning
Some BLAST databases are very big and may require extreme computational resources to build. You may need to use some reduction strategies.
The example shows how to download, parse and build a ganon database from BLAST database files. It does so by splitting the database into taxonomic specific files, to speed-up the build process:
# Define BLAST db
db="16S_ribosomal_RNA"
threads=8
# Download BLAST db - re-run this command many times until all finish (no more output)
curl --silent --list-only ftp://ftp.ncbi.nlm.nih.gov/blast/db/ | grep "^${db}\..*tar.gz$" | xargs -P ${threads:-1} -I{} wget --continue -nd --quiet --show-progress "https://ftp.ncbi.nlm.nih.gov/blast/db/{}"
# OPTIONAL Download and check MD5
wget -O - -nd --quiet --show-progress "ftp://ftp.ncbi.nlm.nih.gov/blast/db/${db}\.*tar.gz.md5" > "${db}.md5"
find -name "${db}.*tar.gz" -type f -printf '%P\n' | xargs -P ${threads:-1} -I{} md5sum {} > "${db}_downloaded.md5"
diff -sy <(sort -k 2,2 "${db}.md5") <(sort -k 2,2 "${db}_downloaded.md5") # Should print "Files /dev/fd/xx and /dev/fd/xx are identical"
# Extract BLAST db files, if successful, remove .tar.gz
find -name "${db}.*tar.gz" -type f -printf '%P\n' | xargs -P ${threads} -I{} sh -c 'gzip -dc {} | tar --overwrite -vxf - && rm {}' > "${db}_extracted_files.txt"
# Create folder to write sequence files (split into 10 sub-folders)
seq 0 9 | xargs -i mkdir -p "${db}"/{}
# This command extracts sequences from the blastdb and writes them into taxid specific files
# It also generates the --input-file for ganon with the fields: filepath <tab> file <tab> taxid
blastdbcmd -entry all -db "${db}" -outfmt "%a %T %s" | \
awk -v db="$(realpath ${db})" '{file=db"/"substr($2,1,1)"/"$2".fna"; print ">"$1"\n"$3 >> file; print file"\t"$2".fna\t"$2}' | \
sort | uniq > "${db}_ganon_input_file.tsv"
# Build ganon database
ganon build-custom --input-file "${db}_ganon_input_file.tsv" --db-prefix "${db}" --threads ${threads} --level leaves
# Delete extracted files and auxiliary files
cat "${db}_extracted_files.txt" | xargs rm
rm "${db}_extracted_files.txt" "${db}.md5" "${db}_downloaded.md5"
# Delete sequences and input_file
rm -rf "${db}" "${db}_ganon_input_file.tsv"
Note
blastdbcmd
is a command from BLAST+ software suite (tested version 2.14.0) and should be installed separately.
Files from genome_updater:)
To create a ganon database from files previously downloaded with genome_updater:
ganon build-custom --input output_folder_genome_updater/version/ --input-recursive --db-prefix mydb --ncbi-file-info output_folder_genome_updater/assembly_summary.txt --level assembly --threads 32
Parameter details:)
False positive and size (--max-fp, --filter-size):)
ganon indices are based on bloom filters and can have false positive matches. This can be controlled with --max-fp
parameter. The lower the --max-fp
, the less chances of false positives matches on classification, but the larger the database size will be. For example, with --max-fp 0.01
the database will be build so any target (defined by --level
) will have 1 in a 100 change of reporting a false k-mer match. The false positive of the query (all k-mers of a read) will be way lower, but directly affected by this value.
Alternatively, one can set a specific size for the final index with --filter-size
. When using this option, please observe the theoretic false positive of the index reported at the end of the building process.
minimizers (--window-size, --kmer-size):)
in ganon build
, when --window-size
> --kmer-size
minimizers are used. That means that for a every window, a single k-mer will be selected. It produces smaller database files and requires substantially less memory overall. It may increase building times but will have a huge benefit for classification times. Sensitivity and precision can be reduced by small margins. If --window-size
= --kmer-size
, all k-mers are going to be used to build the database.
Target file or sequence (--input-target):)
This is a parameter that defines how ganon will parse your input files:
- --input-target file
(default) will consider every file provided with --input
a single unit (e.g. multi-fasta files are considered one input, sequence headers ignored).
- --input-target sequence
will use every sequence as a unit. For this, ganon will first decompose every sequence in the input files provided with --input
into a separated file. This will take longer and use more disk space.
--input-target file
is the default behavior and most efficient way to build databases. --input-target sequence
should only be used when the input sequences are not separated by file (e.g. a single big FASTA file) or when classification at sequence level is desired.
Build level (--level):)
The --level
parameter defines the max. depth of the database for classification. This parameter is relevant because the --max-fp
is going to be guaranteed at the --level
chosen.
In ganon build
the default value is species
. In ganon build-custom
the level will be the same as --input-target
, meaning that classification will be done either at file
or sequence
level.
Alternatively, --level assembly
will link the file or sequence target information with assembly accessions retrieved from NCBI. --level leaves
or --level species
(or genus
, family
, ...) will link the targets with taxonomic information and prune the tree at the chosen level. --level custom
will use specialization (4th col.) defined in the --input-file
.
Genome sizes (--genome-size-files):)
Ganon will automatically download auxiliary files to define an approximate genome size for each entry in the taxonomic tree. For --taxonomy ncbi
the species_genome_size.txt.gz is used. For --taxonomy gtdb
the *_metadata.tar.gz files are used. Those files can be directly provided with the --genome-size-files
argument.
Genome sizes of parent nodes are calculated as the average of the respective children nodes. Other nodes without direct assigned genome sizes will use the closest parent with a pre-calculated genome size. The genome sizes are stored in the ganon database.
Retrieving info (--ncbi-sequence-info, --ncbi-file-info):)
Further taxonomy and assembly linking information has to be collected to properly build the database. --ncbi-sequence-info
and --ncbi-file-info
allow customizations on this step.
When --input-target sequence
, --ncbi-sequence-info
argument allows the use of NCBI e-utils webservices (eutils
) or downloads accession2taxid files to extract target information (options nucl_gb
nucl_wgs
nucl_est
nucl_gss
pdb
prot
dead_nucl
dead_wgs
dead_prot
). By default, ganon uses eutils
up-to 50000 input sequences, otherwise it downloads nucl_gb
nucl_wgs
from https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/. Previously downloaded files can be directly provided with this argument.
When --input-target file
, --ncbi-file-info
uses assembly_summary.txt
from https://ftp.ncbi.nlm.nih.gov/genomes/ to extract target information (options refseq
genbank
refseq_historical
genbank_historical
. Previously downloaded files can be directly provided with this argument.
If you are using outdated, removed or inactive assembly or sequence files and accessions from NCBI, make sure to include dead_nucl
dead_wgs
for --ncbi-sequence-info
or refseq_historical
genbank_historical
for --ncbi-file-info
. eutils
option does not work with outdated accessions.