Build an index

The build subcommand is used to build a custom index to predict from.

$ drprg build -a annotation.gff3 -i catalogue.tsv -f ref.fa -o outdir

Required

Catalogue

Also referred to as a "panel" sometimes. This is the catalogue of mutations that confer resistance (or susceptibility). See the Catalogue docs for more details of what format this file must take. Provided via the -i/--panel option.

Reference genome

A FASTA file of the reference genome that will act as the "skeleton" for the reference graph. Provided via the -f/--fasta option.

Annotation

The annotation is a GFF3 file for the species you are building the index for. Provided via the -a/--gff option.

VCF

A VCF to build the reference graph from. See VCF for a detailed description of this file. Provided via the -b/--vcf option. This option is mutually exclusive with --prebuilt-prg.

Optional

Expert rules

These describe a set of rules - rather than a single mutation - that cause resistance. See Expert rules for a detailed description of the rules. Provided via the -r/--rules option.

Padding

Number of bases to add to the start and end of each gene (locus) in the reference graph. Set to 100 by default. Provided via the -P/--padding option.

Prebuilt reference graph

Advanced users only. For those who know how to build their own reference graphs, and would prefer to do that themselves by hand, rather than let DrPRG do it for them. The value passed to this option must be a directory containing:

  • A reference graph file named dr.prg
  • A directory of multiple sequence alignments (/msas/) that the reference graph was built from. There must be a fasta file alignment for each gene in this directory <gene>.fa.
  • An optional pandora index. If not present, one will be created.

The reference graph is expected to contain the reference sequence ( with padding) for each gene according to the annotation and reference sequence provided.

Provided with the -d/--prebuilt-prg option.

Version

The version name for the index. Set to the current date (YYYYMMDD) by default. Provided via the --version option.

Quick usage

$ drprg build -h
Build an index to predict resistance from

Usage: drprg build [OPTIONS] --gff <FILE> --panel <FILE> --fasta <FILE>

Options:
  -v, --verbose            Use verbose output
  -t, --threads <INT>      Maximum number of threads to use [default: 1]
  -P, --padding <INT>      Number of bases of padding to add to start and end of each gene [default: 100]
  -I, --no-fai             Don't index --fasta if an index doesn't exist
  -C, --no-csi             Don't index --vcf if an index doesn't exist
      --version <VERSION>  Version to use for the index [default: 20230405]
  -h, --help               Print help (see more with '--help')

Input/Output:
  -a, --gff <FILE>          Annotation file that will be used to gather information about genes in catalogue
  -i, --panel <FILE>        Panel/catalogue to build index for
  -f, --fasta <FILE>        Reference genome in FASTA format (must be indexed with samtools faidx)
  -b, --vcf <FILE>          An indexed VCF to build the index PRG from. If not provided, then a prebuilt PRG must be given. See `--prebuilt-prg`
  -o, --outdir <DIR>        Directory to place output [default: .]
  -d, --prebuilt-prg <DIR>  A prebuilt PRG to use
  -r, --rules <FILE>        "Expert rules" to be applied in addition to the catalogue

Full usage

$ drprg build --help
Build an index to predict resistance from

Usage: drprg build [OPTIONS] --gff <FILE> --panel <FILE> --fasta <FILE>

Options:
  -p, --pandora <FILE>
          Path to pandora executable. Will try in src/ext or $PATH if not given

  -v, --verbose
          Use verbose output

  -m, --makeprg <FILE>
          Path to make_prg executable. Will try in src/ext or $PATH if not given

  -t, --threads <INT>
          Maximum number of threads to use

          Use 0 to select the number automatically

          [default: 1]

  -M, --mafft <FILE>
          Path to MAFFT executable. Will try in src/ext or $PATH if not given

  -B, --bcftools <FILE>
          Path to bcftools executable. Will try in src/ext or $PATH if not given

  -P, --padding <INT>
          Number of bases of padding to add to start and end of each gene

          [default: 100]

  -l, --match-len <INT>
          Minimum number of consecutive characters which must be identical for a match in make_prg

          [default: 5]

  -N, --max-nesting <INT>
          Maximum nesting level when constructing the reference graph with make_prg

          [default: 5]

  -k, --pandora-k <INT>
          Kmer size to use for pandora

          [default: 15]

  -w, --pandora-w <INT>
          Window size to use for pandora

          [default: 11]

  -I, --no-fai
          Don't index --fasta if an index doesn't exist

  -C, --no-csi
          Don't index --vcf if an index doesn't exist

      --version <VERSION>
          Version to use for the index

          [default: 20230405]

  -h, --help
          Print help (see a summary with '-h')

Input/Output:
  -a, --gff <FILE>
          Annotation file that will be used to gather information about genes in catalogue

  -i, --panel <FILE>
          Panel/catalogue to build index for

  -f, --fasta <FILE>
          Reference genome in FASTA format (must be indexed with samtools faidx)

  -b, --vcf <FILE>
          An indexed VCF to build the index PRG from. If not provided, then a prebuilt PRG must be given. See `--prebuilt-prg`

  -o, --outdir <DIR>
          Directory to place output

          [default: .]

  -d, --prebuilt-prg <DIR>
          A prebuilt PRG to use.

          Only build the panel VCF and reference sequences - not the PRG. This directory MUST contain a PRG file named `dr.prg`, along, with a directory called `msas/` that contains an MSA fasta file for each gene `<gene>.fa`. There can optionally also be a pandora index file, but if not, the indexing will be performed by drprg. Note: the PRG is expected to contain the reference sequence for each gene according to the annotation and reference genome given (along with padding) and must be in the forward strand orientation.

  -r, --rules <FILE>
          "Expert rules" to be applied in addition to the catalogue.

          CSV file with blanket rules that describe resistance (or susceptibility). The columns are <variant type>,<gene>,<start>,<end>,<drug(s)>. See the docs for a detailed explanation.