Build an index
The build
subcommand is used to build a custom index to predict
from.
$ drprg build -a annotation.gff3 -i catalogue.tsv -f ref.fa -o outdir
Required
Catalogue
Also referred to as a "panel" sometimes. This is the catalogue of mutations that confer
resistance (or susceptibility). See the Catalogue docs for more
details of what format this file must take. Provided via the -i/--panel
option.
Reference genome
A FASTA file of the reference genome that will act as the "skeleton" for the reference
graph. Provided via the -f/--fasta
option.
Annotation
The annotation is a GFF3 file for the species you are building the index for.
Provided via the -a/--gff
option.
VCF
A VCF to build the reference graph from. See VCF for a detailed description
of this file. Provided via the -b/--vcf
option. This option is mutually exclusive
with --prebuilt-prg
.
Optional
Expert rules
These describe a set of rules - rather than a single mutation - that cause resistance.
See Expert rules for a detailed description of the rules. Provided via
the -r/--rules
option.
Padding
Number of bases to add to the start and end of each gene (locus) in the reference graph.
Set to 100 by default. Provided via the -P/--padding
option.
Prebuilt reference graph
Advanced users only. For those who know how to build their own reference graphs, and would prefer to do that themselves by hand, rather than let DrPRG do it for them. The value passed to this option must be a directory containing:
- A reference graph file named
dr.prg
- A directory of multiple sequence alignments (
/msas/
) that the reference graph was built from. There must be a fasta file alignment for each gene in this directory<gene>.fa
. - An optional
pandora
index. If not present, one will be created.
The reference graph is expected to contain the reference sequence ( with padding) for each gene according to the annotation and reference sequence provided.
Provided with the -d/--prebuilt-prg
option.
Version
The version name for the index. Set to the current date (YYYYMMDD
) by default.
Provided via the --version
option.
Quick usage
$ drprg build -h
Build an index to predict resistance from
Usage: drprg build [OPTIONS] --gff <FILE> --panel <FILE> --fasta <FILE>
Options:
-v, --verbose Use verbose output
-t, --threads <INT> Maximum number of threads to use [default: 1]
-P, --padding <INT> Number of bases of padding to add to start and end of each gene [default: 100]
-I, --no-fai Don't index --fasta if an index doesn't exist
-C, --no-csi Don't index --vcf if an index doesn't exist
--version <VERSION> Version to use for the index [default: 20230405]
-h, --help Print help (see more with '--help')
Input/Output:
-a, --gff <FILE> Annotation file that will be used to gather information about genes in catalogue
-i, --panel <FILE> Panel/catalogue to build index for
-f, --fasta <FILE> Reference genome in FASTA format (must be indexed with samtools faidx)
-b, --vcf <FILE> An indexed VCF to build the index PRG from. If not provided, then a prebuilt PRG must be given. See `--prebuilt-prg`
-o, --outdir <DIR> Directory to place output [default: .]
-d, --prebuilt-prg <DIR> A prebuilt PRG to use
-r, --rules <FILE> "Expert rules" to be applied in addition to the catalogue
Full usage
$ drprg build --help
Build an index to predict resistance from
Usage: drprg build [OPTIONS] --gff <FILE> --panel <FILE> --fasta <FILE>
Options:
-p, --pandora <FILE>
Path to pandora executable. Will try in src/ext or $PATH if not given
-v, --verbose
Use verbose output
-m, --makeprg <FILE>
Path to make_prg executable. Will try in src/ext or $PATH if not given
-t, --threads <INT>
Maximum number of threads to use
Use 0 to select the number automatically
[default: 1]
-M, --mafft <FILE>
Path to MAFFT executable. Will try in src/ext or $PATH if not given
-B, --bcftools <FILE>
Path to bcftools executable. Will try in src/ext or $PATH if not given
-P, --padding <INT>
Number of bases of padding to add to start and end of each gene
[default: 100]
-l, --match-len <INT>
Minimum number of consecutive characters which must be identical for a match in make_prg
[default: 5]
-N, --max-nesting <INT>
Maximum nesting level when constructing the reference graph with make_prg
[default: 5]
-k, --pandora-k <INT>
Kmer size to use for pandora
[default: 15]
-w, --pandora-w <INT>
Window size to use for pandora
[default: 11]
-I, --no-fai
Don't index --fasta if an index doesn't exist
-C, --no-csi
Don't index --vcf if an index doesn't exist
--version <VERSION>
Version to use for the index
[default: 20230405]
-h, --help
Print help (see a summary with '-h')
Input/Output:
-a, --gff <FILE>
Annotation file that will be used to gather information about genes in catalogue
-i, --panel <FILE>
Panel/catalogue to build index for
-f, --fasta <FILE>
Reference genome in FASTA format (must be indexed with samtools faidx)
-b, --vcf <FILE>
An indexed VCF to build the index PRG from. If not provided, then a prebuilt PRG must be given. See `--prebuilt-prg`
-o, --outdir <DIR>
Directory to place output
[default: .]
-d, --prebuilt-prg <DIR>
A prebuilt PRG to use.
Only build the panel VCF and reference sequences - not the PRG. This directory MUST contain a PRG file named `dr.prg`, along, with a directory called `msas/` that contains an MSA fasta file for each gene `<gene>.fa`. There can optionally also be a pandora index file, but if not, the indexing will be performed by drprg. Note: the PRG is expected to contain the reference sequence for each gene according to the annotation and reference genome given (along with padding) and must be in the forward strand orientation.
-r, --rules <FILE>
"Expert rules" to be applied in addition to the catalogue.
CSV file with blanket rules that describe resistance (or susceptibility). The columns are <variant type>,<gene>,<start>,<end>,<drug(s)>. See the docs for a detailed explanation.