👩‍⚕Dr. PRG - Drug resistance Prediction with Reference Graphs️👨‍⚕️

Full documentation: https://mbh.sh/drprg/

As the name suggests, Dr. PRG (pronounced "Doctor P-R-G") is a tool for predicting drug resistance from sequencing data. It can be used for any species, provided an index is available for that species. The documentation outlines which species have prebuilt indices and also a guide for how to create your own.

Quick Installation

conda install -c bioconda drprg

Linux is currently the only supported platform; however, there is a Docker container that can be used on other platforms.

See the installation guide for more options.

Quick usage

Download the latest M. tuberculosis prebuilt index

drprg index --download mtb

Predict resistance from an Illumina fastq

drprg predict -x mtb -i reads.fq --illumina -o outdir/

Help

$ drprg -h
Drug Resistance Prediction with Reference Graphs

Usage: drprg [OPTIONS] <COMMAND>

Commands:
  build    Build an index to predict resistance from
  predict  Predict drug resistance
  index    Download and interact with indices
  help     Print this message or the help of the given subcommand(s)

Options:
  -v, --verbose        Use verbose output
  -t, --threads <INT>  Maximum number of threads to use [default: 1]
  -h, --help           Print help (see more with '--help')
  -V, --version        Print version

Citation

Hall MB, Lima L, Coin LJM, Iqbal Z (2023) Drug resistance prediction for Mycobacterium tuberculosis with reference graphs. Microbial Genomics 9:001081. doi: 10.1099/mgen.0.001081

@article{hall_drug_2023,
	title = {Drug resistance prediction for {Mycobacterium} tuberculosis with reference graphs},
	volume = {9},
	copyright = {All rights reserved},
	issn = {2057-5858},
	url = {https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.001081},
	doi = {10.1099/mgen.0.001081},
	number = {8},
	journal = {Microbial Genomics},
	author = {Hall, Michael B. and Lima, Leandro and Coin, Lachlan J. M. and Iqbal, Zamin},
	year = {2023},
	pages = {001081},
}

Installation

Linux is currently the only supported platform (due to pandora). For other platforms, the Docker container is the only option.

Conda

conda install -c bioconda drprg

Container

A Docker container is available for all commits/branches/versions. To view the available tags, visit https://quay.io/repository/mbhall88/drprg?tab=tags

For example, to use the latest commit on the main branch, the URI is

$ TAG="latest"
$ URI="quay.io/mbhall88/drprg:$TAG"

Docker

To run drprg using the above container with Docker

$ docker pull "$URI"
$ docker run -it "$URI" drprg --help

Singularity

To run drprg using the above container with Singularity

$ singularity exec "docker://$URI" drprg --help

Prebuilt binary

If you use the prebuilt binary, you must have the external dependecies installed separately.

curl -sSL drprg.mbh.sh | sh
# or with wget
wget -nv -O - drprg.mbh.sh | sh

You can also pass options to the script like so

$ curl -sSL drprg.mbh.sh | sh -s -- --help
install.sh [option]

Fetch and install the latest version of drprg, if drprg is already
installed it will be updated to the latest version.

Options
        -V, --verbose
                Enable verbose output for the installer

        -f, -y, --force, --yes
                Skip the confirmation prompt during installation

        -p, --platform
                Override the platform identified by the installer

        -b, --bin-dir
                Override the bin installation directory [default: /usr/local/bin]

        -a, --arch
                Override the architecture identified by the installer [default: x86_64]

        -B, --base-url
                Override the base URL used for downloading releases [default: https://github.com/mbhall88/drprg/releases]

        -h, --help
                Display this help message

Cargo

If installing via cargo, you must have the external dependecies installed separately.

$ cargo install drprg

Local

Minimum supported Rust version: 1.65.0

$ cargo build --release
$ target/release/drprg -h

Dependencies

drprg relies on:

You can install the dependencies using the provided justfile

# all dependencies
$ just deps

# pandora only
$ just pandora

# make_prg only
$ just makeprg

# mafft only
$ just mafft

# bcftools only
$ just bcftools

By default, the external dependencies will be downloaded to src/ext. This can be changed by specifying a path to EXTDIR when installing the external dependencies.

$ just deps EXTDIR="some/other/dir"

Download prebuilt index

The index subcommand can be used to download or list available indices.

To download the latest index version for all species

drprg index --download
# is the same as
drprg index --download all

If you only want the latest for a particular species

drprg index --download mtb
# is the same as
drprg index --download mtb@latest

Available indices

If you want to see what indices are available in the first place

drprg index --list
+--------------+---------+----------+------------+
| Name         | Species | Version  | Downloaded |
+--------------+---------+----------+------------+
| mtb@20230308 | mtb     | 20230308 | N          |
+--------------+---------+----------+------------+

You can specify a version with the following syntax

drprg index --download mtb@<version>

where <version> is the version you want. If no version is provided, latest is used.

To see more information about each prebuilt index, head over to https://github.com/mbhall88/drprg-index.

Output directory

By default, indices are stored in $HOME/.drprg/, nested as species/species-version. So downloading version 20230308 of species mtb will produce an index at $HOME/.drprg/mtb/mtb-20230308/.

You can change the default output directory ($HOME/.drprg/) with the --outdir option. If you do this, you'll need to pass the full path to the index when using predict.

Full usage

$ drprg index --help
Download and interact with indices

Usage: drprg index [OPTIONS] [NAME]

Arguments:
  [NAME]
          The name/path of the index to download

          [default: all]

Options:
  -d, --download
          Download a prebuilt index

  -v, --verbose
          Use verbose output

  -l, --list
          List all available (and downloaded) indices

  -t, --threads <INT>
          Maximum number of threads to use

          Use 0 to select the number automatically

          [default: 1]

  -o, --outdir <DIR>
          Index directory

          Use this if your indices are not in a default location, or you want to download them to a non-default location

          [default: $HOME/.drprg/]

  -F, --force
          Overwrite any existing indices

  -h, --help
          Print help (see a summary with '-h')

Predict

The predict subcommand is used to predict resistance for a sample from an index.

At its simplest

drprg predict -i reads.fq -x mtb -o outdir

drprg is a bit "new-age" in that it assumes the reads are Nanopore. If they're Illumina, use the -I/--illumina option.

See Prediction Output documentation for a detailed description of what results/output files and formats to expect.

Required

Index

The index is provided via the -x/--index option. It can either be a path to an index, or the name of a downloaded index. As with the index subcommmand, you can specify a version if you don't want to use the latest.

Input reads

A fastq (or fasta) file of the reads you want to predict resistance from - provided via the -i/--input option. If you have paired reads in two files, simply combine them and pass the combined file - interleave order doesn't matter. For example

cat r1.fq r2.fq > combined.fq
drprg predict -i combined.fq ...

gzip-compressed files are also accepted.

Optional

Sample name

Identifier to use for your output files. By default, it will be set to the file name prefix (e.g. name for a fastq named name.fq.gz). Provided via the -s/--sample option.

Minimum allele frequency

Provided via the -f/--maf option. If an alternate allele has at least this fraction of the depth, a minor resistance ("r") prediction is made. By default, this is set to 1.0 for Nanopore data (i.e. minor allele detection is off) and 0.1 when using the --illumina option. For example, if a variant is called as the reference allele for Illumina reads, but an alternate allele has more than 10% of the depth on that position, a minor resistance call is made for the alternate allele.

Ignore synonymous

Using the -S/--ignore-synonymous option will prevent synonymous mutations from appearing as unknown resistance calls. However, any synonymous mutations in the catalogue will still be considered.

Quick usage

$ drprg predict -h
Predict drug resistance

Usage: drprg predict [OPTIONS] --index <DIR> --input <FILE>

Options:
  -v, --verbose        Use verbose output
  -t, --threads <INT>  Maximum number of threads to use [default: 1]
  -h, --help           Print help (see more with '--help')

Input/Output:
  -x, --index <DIR>      Name of a downloaded index or path to an index
  -i, --input <FILE>     Reads to predict resistance from
  -o, --outdir <DIR>     Directory to place output [default: .]
  -s, --sample <SAMPLE>  Identifier to use for the sample
  -I, --illumina         Sample reads are from Illumina sequencing

Filter:
  -S, --ignore-synonymous     Ignore unknown (off-catalogue) variants that cause a synonymous substitution
  -f, --maf <FLOAT[0.0-1.0]>  Minimum allele frequency to call variants [default: 1]

Full usage

$ drprg predict --help
Predict drug resistance

Usage: drprg predict [OPTIONS] --index <DIR> --input <FILE>

Options:
  -p, --pandora <FILE>
          Path to pandora executable. Will try in src/ext or $PATH if not given

  -v, --verbose
          Use verbose output

  -m, --makeprg <FILE>
          Path to make_prg executable. Will try in src/ext or $PATH if not given

  -t, --threads <INT>
          Maximum number of threads to use

          Use 0 to select the number automatically

          [default: 1]

  -M, --mafft <FILE>
          Path to MAFFT executable. Will try in src/ext or $PATH if not given

  -h, --help
          Print help (see a summary with '-h')

Input/Output:
  -x, --index <DIR>
          Name of a downloaded index or path to an index

  -i, --input <FILE>
          Reads to predict resistance from

          Both fasta and fastq are accepted, along with compressed or uncompressed.

  -o, --outdir <DIR>
          Directory to place output

          [default: .]

  -s, --sample <SAMPLE>
          Identifier to use for the sample

          If not provided, this will be set to the input reads file path prefix

  -I, --illumina
          Sample reads are from Illumina sequencing

Filter:
  -S, --ignore-synonymous
          Ignore unknown (off-catalogue) variants that cause a synonymous substitution

  -f, --maf <FLOAT[0.0-1.0]>
          Minimum allele frequency to call variants

          If an alternate allele has at least this fraction of the depth, a minor resistance ("r") prediction is made. Set to 1 to disable. If --illumina is passed, the default is 0.1

          [default: 1]

      --debug
          Output debugging files. Mostly for development purposes

  -d, --min-covg <INT>
          Minimum depth of coverage allowed on variants

          [default: 3]

  -D, --max-covg <INT>
          Maximum depth of coverage allowed on variants

          [default: 2147483647]

  -b, --min-strand-bias <FLOAT>
          Minimum strand bias ratio allowed on variants

          For example, setting to 0.25 requires >=25% of total (allele) coverage on both strands for an allele.

          [default: 0.01]

  -g, --min-gt-conf <FLOAT>
          Minimum genotype confidence (GT_CONF) score allow on variants

          [default: 0]

  -L, --max-indel <INT>
          Maximum (absolute) length of insertions/deletions allowed

  -K, --min-frs <FLOAT>
          Minimum fraction of read support

          For example, setting to 0.9 requires >=90% of coverage for the variant to be on the called allele

          [default: 0]

Prediction Output

Inside the predict output directory (-o/--outdir) you'll find a collection of files and directories. We outline the files that are likely to be of interest to most users.

Prediction JSON

This is <sample>.drprg.json. The value given to the -s/--sample option dictates the name - i.e. <sample>.drprg.json.

This is the only file most users will want/need to interact with. It contains the resistance prediction for each drug in the index's catalogue.

Example

This is a trimmmed (toy) example JSON output for a sample

{
  "genes": {
    "absent": [
      "ahpC"
    ],
    "present": [
      "embA",
      "embB",
      "ethA",
      "fabG1",
      "gid",
      "gyrA",
      "gyrB",
      "inhA",
      "katG",
      "pncA",
      "rpoB",
      "rrs"
    ]
  },
  "sample": "toy",
  "susceptibility": {
    "Amikacin": {
      "evidence": [
        {
          "gene": "rrs",
          "residue": "DNA",
          "variant": "A1401X",
          "vcfid": "b815ed3f"
        }
      ],
      "predict": "F"
    },
    "Ethambutol": {
      "evidence": [
        {
          "gene": "embB",
          "residue": "PROT",
          "variant": "M306I",
          "vcfid": "a290b118"
        }
      ],
      "predict": "r"
    },
    "Ethionamide": {
      "evidence": [
        {
          "gene": "ethA",
          "residue": "PROT",
          "variant": "A381P",
          "vcfid": "169f75d4"
        }
      ],
      "predict": "U"
    },
    "Isoniazid": {
      "evidence": [
        {
          "gene": "fabG1",
          "residue": "DNA",
          "variant": "G-17T",
          "vcfid": "de9b689e"
        },
        {
          "gene": "katG",
          "residue": "PROT",
          "variant": "S315T",
          "vcfid": "acaa8ca2"
        }
      ],
      "predict": "R"
    },
    "Levofloxacin": {
      "evidence": [],
      "predict": "S"
    }
  },
  "version": {
    "drprg": "0.1.1",
    "index": "20230308"
  }
}

The keys of the JSON are

genes: This contains a list of genes in the index reference graph which are present and absent
sample: The value passed to the -s/--sample option
susceptibility: The keys of this entry are the drugs in the index catalogue. Each drug's entry contains evidence supporting the value in the predict section.

Predict

The predict entry for a drug is the resistance prediction for the sample. Possible values are

S: susceptible. This is the "default" prediction. If no mutations are detected for the sample, it is assumed to be susceptible
F: failed. Genotyping failed for one or more mutations for this drug. See the prediction VCF for more information
U: unknown. One or more mutations that are not present in the index catalogue were detected in a gene associated with this drug
R: resistant. One or more mutations from the index catalogue that confer resistance were detected
u or r: The same as the uppercase versions, but the mutation(s) were detected in a minor allele.

Evidence

This is a list of the mutations supporting the prediction. The residue is one of DNA or PROT indicating whether the mutation describes a nucleotide or amino acid change, respectively.

The variant is of the form <ref><pos><alt>; where <ref> is the reference sequence at position <pos> and <alt> is the nucleotide/amino acid the reference is changed to. See the catalogue docs for more information.

The vcfid is the value in the VCF ID column for this mutation, making it easier to find a mutation in the VCF.

Prediction VCF

This is <sample>.drprg.bcf. As this file is a BCF, you will need to use bcftools to view it - e.g. bcftools view sample.drprg.bcf.

You should only need to interact with this file if you want further information about the exact evidence supporting a mutation being called, why a mutation was called as failed (F), or why a mutation wasn't called. For those mutations in the JSON, you can easily look them up using the vcfid, which can be found in the ID (third) column of the BCF. Or you can just use grep. For example, to look up the Isoniazid S315T mutation in katG from the example you can use

$ bcftools view sample.drprg.bcf | grep acaa8ca2
katG    1044    acaa8ca2        GC      AC,CA,CC        .       PASS    VC=PH_SNPs;GRAPHTYPE=SIMPLE;PDP=0,0.0123457,0,0.987654;VARID=katG_S315T;PREDICT=R       GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      3:0.1.1,42:0,0,0,38:0.1.1,42:0,0,0,38:0.1.1,127:0,0,0,116:1,1,1,0:-523.019,-514.096,-523.019,-7.87925:506.217

All INFO and FORMAT fields are defined in the header of the BCF file. We recommend reading the VCF/BCF specifications for help with how to interpret data in a VCF file.

Build an index

The build subcommand is used to build a custom index to predict from.

$ drprg build -a annotation.gff3 -i catalogue.tsv -f ref.fa -o outdir

Required

Catalogue

Also referred to as a "panel" sometimes. This is the catalogue of mutations that confer resistance (or susceptibility). See the Catalogue docs for more details of what format this file must take. Provided via the -i/--panel option.

Reference genome

A FASTA file of the reference genome that will act as the "skeleton" for the reference graph. Provided via the -f/--fasta option.

Annotation

The annotation is a GFF3 file for the species you are building the index for. Provided via the -a/--gff option.

VCF

A VCF to build the reference graph from. See VCF for a detailed description of this file. Provided via the -b/--vcf option. This option is mutually exclusive with --prebuilt-prg.

Optional

Expert rules

These describe a set of rules - rather than a single mutation - that cause resistance. See Expert rules for a detailed description of the rules. Provided via the -r/--rules option.

Padding

Number of bases to add to the start and end of each gene (locus) in the reference graph. Set to 100 by default. Provided via the -P/--padding option.

Prebuilt reference graph

Advanced users only. For those who know how to build their own reference graphs, and would prefer to do that themselves by hand, rather than let DrPRG do it for them. The value passed to this option must be a directory containing:

A reference graph file named dr.prg
A directory of multiple sequence alignments (/msas/) that the reference graph was built from. There must be a fasta file alignment for each gene in this directory <gene>.fa.
An optional pandora index. If not present, one will be created.

The reference graph is expected to contain the reference sequence ( with padding) for each gene according to the annotation and reference sequence provided.

Provided with the -d/--prebuilt-prg option.

Version

The version name for the index. Set to the current date (YYYYMMDD) by default. Provided via the --version option.

Quick usage

$ drprg build -h
Build an index to predict resistance from

Usage: drprg build [OPTIONS] --gff <FILE> --panel <FILE> --fasta <FILE>

Options:
  -v, --verbose            Use verbose output
  -t, --threads <INT>      Maximum number of threads to use [default: 1]
  -P, --padding <INT>      Number of bases of padding to add to start and end of each gene [default: 100]
  -I, --no-fai             Don't index --fasta if an index doesn't exist
  -C, --no-csi             Don't index --vcf if an index doesn't exist
      --version <VERSION>  Version to use for the index [default: 20230405]
  -h, --help               Print help (see more with '--help')

Input/Output:
  -a, --gff <FILE>          Annotation file that will be used to gather information about genes in catalogue
  -i, --panel <FILE>        Panel/catalogue to build index for
  -f, --fasta <FILE>        Reference genome in FASTA format (must be indexed with samtools faidx)
  -b, --vcf <FILE>          An indexed VCF to build the index PRG from. If not provided, then a prebuilt PRG must be given. See `--prebuilt-prg`
  -o, --outdir <DIR>        Directory to place output [default: .]
  -d, --prebuilt-prg <DIR>  A prebuilt PRG to use
  -r, --rules <FILE>        "Expert rules" to be applied in addition to the catalogue

Full usage

$ drprg build --help
Build an index to predict resistance from

Usage: drprg build [OPTIONS] --gff <FILE> --panel <FILE> --fasta <FILE>

Options:
  -p, --pandora <FILE>
          Path to pandora executable. Will try in src/ext or $PATH if not given

  -v, --verbose
          Use verbose output

  -m, --makeprg <FILE>
          Path to make_prg executable. Will try in src/ext or $PATH if not given

  -t, --threads <INT>
          Maximum number of threads to use

          Use 0 to select the number automatically

          [default: 1]

  -M, --mafft <FILE>
          Path to MAFFT executable. Will try in src/ext or $PATH if not given

  -B, --bcftools <FILE>
          Path to bcftools executable. Will try in src/ext or $PATH if not given

  -P, --padding <INT>
          Number of bases of padding to add to start and end of each gene

          [default: 100]

  -l, --match-len <INT>
          Minimum number of consecutive characters which must be identical for a match in make_prg

          [default: 5]

  -N, --max-nesting <INT>
          Maximum nesting level when constructing the reference graph with make_prg

          [default: 5]

  -k, --pandora-k <INT>
          Kmer size to use for pandora

          [default: 15]

  -w, --pandora-w <INT>
          Window size to use for pandora

          [default: 11]

  -I, --no-fai
          Don't index --fasta if an index doesn't exist

  -C, --no-csi
          Don't index --vcf if an index doesn't exist

      --version <VERSION>
          Version to use for the index

          [default: 20230405]

  -h, --help
          Print help (see a summary with '-h')

Input/Output:
  -a, --gff <FILE>
          Annotation file that will be used to gather information about genes in catalogue

  -i, --panel <FILE>
          Panel/catalogue to build index for

  -f, --fasta <FILE>
          Reference genome in FASTA format (must be indexed with samtools faidx)

  -b, --vcf <FILE>
          An indexed VCF to build the index PRG from. If not provided, then a prebuilt PRG must be given. See `--prebuilt-prg`

  -o, --outdir <DIR>
          Directory to place output

          [default: .]

  -d, --prebuilt-prg <DIR>
          A prebuilt PRG to use.

          Only build the panel VCF and reference sequences - not the PRG. This directory MUST contain a PRG file named `dr.prg`, along, with a directory called `msas/` that contains an MSA fasta file for each gene `<gene>.fa`. There can optionally also be a pandora index file, but if not, the indexing will be performed by drprg. Note: the PRG is expected to contain the reference sequence for each gene according to the annotation and reference genome given (along with padding) and must be in the forward strand orientation.

  -r, --rules <FILE>
          "Expert rules" to be applied in addition to the catalogue.

          CSV file with blanket rules that describe resistance (or susceptibility). The columns are <variant type>,<gene>,<start>,<end>,<drug(s)>. See the docs for a detailed explanation.

VCF to build reference graph

The input VCF (-b/--vcf) for build has some very particular requirements. If you aren't very familiar with the formal VCF specifications, please read them first.

This VCF file is what drprg uses to construct the reference graph from. Variants from it are applied to the skeleton reference genome. It represents the population variation you would like to build into the reference graph. For a greater understanding of these concepts, we recommend reading this paper.

To add variants from multiple samples, simply use a multi-sample VCF.

The CHROM field should be the gene name. As such, the POS is with respect to the gene. It's important to ensure the POS takes into account the padding that will be used when running build. A helper script for converting a VCF from reference genome coordinates to gene coordinates can be found here. In addition, drprg will raise an error if the coordinates are incorrect with respect to the annotation and reference genome you provide.

Example from Mycobacterium tuberculosis

We provide an example of how the VCF for the work on M. tuberculosis (MTB) was generated.

Call variants for population

The first step is to select the population of samples you want variation from. For the MTB reference graph, we took a subset of the 15,211 global MTB isolates published by the CRyPTIC consortium. Briefly, the 15,211 samples were individually variant-called using clockwork and then joint-genotyped using minos. The "wide" VCF obtained at the end contains genotypes for all 15,211 samples against all variants discovered across the population.

Note: the reference genome you call these variants against must be the same the skeleton reference.

(Optional) subset population

In our running example, we have 15,211 samples in our VCF. This is far too much to be useful for building a reference graph. This paper illustrates the problem nicely. In addition, Figure 3.2 and Sections 3.6.4 and 3.6.5 in this thesis show that using too many samples does not provide any benefit to variant precision or recall, but costs increased memory usage and CPU time.

To create a representative subset, we split the VCF file into separate lineage VCFs, where all samples in a VCF are of the same lineage. We randomly chose 20 samples from each lineage 1 through 4, as well as 20 samples from all other lineages combined. In addition, we included 17 clinical samples representing MTB global diversity (lineages 1-6) to give a total of 117 samples.

We generated the subsets by listing all samples in the VCF file

bcftools query --list-samples > samples.txt

and then splitting this list by lineage based on lineage classifications. If we have a list of Lineage 1 samples in a text file, we can get 20 random samples from it with

shuf -n 20 L1.txt > L1.subsampled.txt

we then combine all of these text files of sample names into a single text file, extract the 117 samples, and filter the remaining variants with

bcftools view -S samples.subsampled.txt wide.vcf.gz |
  bcftools view -a -U -c=1:nref -o MTB.vcf.gz

-a trims ALT alleles not seen in the genotype fields, -U excludes sites without a called genotype, and -c=1:nref removes positions there is no non-reference alleles.

You can find the resulting VCF (MTB.vcf.gz) we generated here.

(Optional) Manually add "orphan" mutations

We also manually added ~50 mutations that were commonly found in minor frequencies but were not contained in the population VCF.

This is done by creating a file of mutations of the form <gene>_<mutation> - this is effectively column 1 and 2 from the catalogue separated by an underscore. For example

rpoB_I491F
rplC_C154R
ddn_L49P
gid_Q125*
rrs_g1484t
embB_L74R

the difference between this and the catalogue though is that DNA mutations should have the nucleotides in lowercase (e.g. rrs_g1484t). A VCF can be generated for this list of mutations using this script.

python create_orphan_mutations.py -r ref.fa -m orphan_mutations.txt \
  -g annotation.gff -o orphan.vcf

We then merge this VCF of "orphan" mutations into the population VCF with

bcftools merge MTB.vcf.gz orphan.vcf |
  bcftools norm -c e -f reference.fa -o merged.vcf

-c e just tells bcftools norm to error if the REF alleles in the VCFs do not match the reference.

Extract catalogue genes from VCF

DrPRG expects the VCF CHROM field to be the name of the gene, rather than the reference contig/chromosome name - which is what we currently have. As such, the POS must be with respect to the gene. It's important to ensure the POS takes into account the padding that will be used when running build.

A helper script for converting a VCF from reference genome coordinates to gene coordinates can be found here. In addition, drprg will raise an error if the coordinates are incorrect with respect to the annotation and reference genome you provide. By providing this script with the catalogue/panel you will use in build, it will extract only those variants that fall within the genes in your catalogue.

python extract_panel_genes_from_vcf.py --padding 100 \
  -i catalogue.tsv -g annotation.gff --vcf merged.vcf -o final.vcf

final.vcf can now be used as the input VCF for build.

Catalogue of mutations

Also referred to as a "panel" sometimes. This is the catalogue of mutations that confer resistance (or susceptibility). Provided via the -i/--panel option.

This file is a tab-delimited (TSV) file with four columns

The gene name of the mutation
The mutation in the form <ref><pos><alt>, where <ref> is the reference nucleotide or amino acid, <pos> is the 1-based position in the gene/protein, and <alt> is the nucleotide or amino acid the reference is changed to. <pos> is with respect to the type of mutation - i.e. if the mutation is a amino acid change, the <pos> must be position within the protein (codon position). Promoter mutations can be specied with a negative symbol - e.g. -10 means 10 nucleotide before the first position in the gene. One-letter codes are to be used for amino acids.
The residue type - DNA (nucleotide change) or PROT (amino acid change).
Comma-separated list of the drug(s) the mutation confers resistance to. Use NONE if the mutation is not associated with resistance.

Example

pncA    TCG196TAG   DNA Pyrazinamide
pncA    T142R   PROT    Pyrazinamide
tlyA    S159*   PROT    Capreomycin
embB    M306N   PROT    Ethambutol
rpoB    C1275CCA    DNA Rifampicin
gid R137P   PROT    Streptomycin
gid R118S   PROT    Streptomycin
tlyA    G123GC  DNA Capreomycin
pncA    G97D    PROT    Pyrazinamide
fabG1   C-15X   DNA Ethionamide,Isoniazid

Expert rules

These are blanket rules that describe resistance (or susceptibility). The file is a CSV with each row representing a rule and is passed to drprg build via the --rules option. The format of each row is

vartype,gene,start,end,drug

vartype: the variant type of the rule. Supported types are:
- frameshift - Any insertion or deletion whose length is not a multiple of three
- missense - A DNA change that results in a different amino acid
- nonsense - A DNA change that results in a stop codon instead of an amino acid
- absence - Gene is absent
gene: the name of the gene the rule applies to
start: An optional start position for the rule to apply from. The position is in codon coordinates where the rule applies to amino acid changes and is 1-based inclusive. If not provided, the start of the gene is inferred. If you want to include the upstream (promoter) region of the gene, use negative coordinates.
end: An optional end position for the rule to apply to. The position is in codon coordinates where the rule applies to amino acid changes and is 1-based inclusive. If not provided, the end of the gene is inferred.
drug: A semi-colon-delimited (;) list of drugs the rule impacts. If the rule confers susceptibility, use NONE for this column.

If there are certain rules you need for your species-of-interest, raise an issue, and we can look at implementing it.

Example

This is an example of the M. tuberculosis expert rules file used in our paper.

missense,rpoB,426,452,Rifampicin
nonsense,rpoB,426,452,Rifampicin
frameshift,rpoB,1276,1356,Rifampicin
nonsense,katG,,,Isoniazid
frameshift,katG,,,Isoniazid
absence,katG,,,Isoniazid
nonsense,ethA,,,Ethionamide
frameshift,ethA,,,Ethionamide
absence,ethA,,,Ethionamide
nonsense,gid,,,Streptomycin
frameshift,gid,,,Streptomycin
absence,gid,,,Streptomycin
nonsense,pncA,,,Pyrazinamide
frameshift,pncA,,,Pyrazinamide
absence,pncA,,,Pyrazinamide
missense,katG,315,315,Isoniazid
missense,gid,125,125,Streptomycin
missense,rpoB,425,425,Rifampicin
missense,gid,136,136,Streptomycin

The row

frameshift,pncA,,,Pyrazinamide

says that a frameshift anywhere within the pncA gene will cause resistance to Pyrazinamide

nonsense,rpoB,426,452,Rifampicin
frameshift,rpoB,1276,1356,Rifampicin

these two rules illustrate the context of the start and end coordinates. In the first row, we say that any nonsense mutation between 426 and 452 in rpoB causes resistance to Rifampicin. As nonsense mutations only apply to amino acid changes, the coordinates are in codon-space. Whereas the second row describes a frameshift, which only applies to nucleotides; therefore, 1276 and 1356 are in bases-space (i.e. the 1276th nucleotide/base). (As an aside, these two rules both apply to the same region - the RRDR)

missense,katG,315,315,Isoniazid

describes any missense mutation at position 315 in katG causing isoniazid resistance.

Contributing

If you would like to contribute to Dr. PRG, feel free to open a pull request with suggested changes, or even raise an issue to discuss potential contributions.

If you would like to contribute a prebuilt index for a species, head on over to https://github.com/mbhall88/drprg-index

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[Unreleased]

0.1.1 - 2023-04-06

Fixed

CI issue that meant no binaries were built

0.1.0 - 2023-04-06

First release! Everything is new!