👩⚕Dr. PRG - Drug resistance Prediction with Reference Graphs️👨⚕️
Full documentation: https://mbh.sh/drprg/
As the name suggests, Dr. PRG (pronounced "Doctor P-R-G") is a tool for predicting drug resistance from sequencing data. It can be used for any species, provided an index is available for that species. The documentation outlines which species have prebuilt indices and also a guide for how to create your own.
Quick Installation
conda install -c bioconda drprg
Linux is currently the only supported platform; however, there is a Docker container that can be used on other platforms.
See the installation guide for more options.
Quick usage
Download the latest M. tuberculosis prebuilt index
drprg index --download mtb
Predict resistance from an Illumina fastq
drprg predict -x mtb -i reads.fq --illumina -o outdir/
Help
$ drprg -h
Drug Resistance Prediction with Reference Graphs
Usage: drprg [OPTIONS] <COMMAND>
Commands:
build Build an index to predict resistance from
predict Predict drug resistance
index Download and interact with indices
help Print this message or the help of the given subcommand(s)
Options:
-v, --verbose Use verbose output
-t, --threads <INT> Maximum number of threads to use [default: 1]
-h, --help Print help (see more with '--help')
-V, --version Print version
Citation
Hall MB, Lima L, Coin LJM, Iqbal Z (2023) Drug resistance prediction for Mycobacterium tuberculosis with reference graphs. Microbial Genomics 9:001081. doi: 10.1099/mgen.0.001081
@article{hall_drug_2023,
title = {Drug resistance prediction for {Mycobacterium} tuberculosis with reference graphs},
volume = {9},
copyright = {All rights reserved},
issn = {2057-5858},
url = {https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.001081},
doi = {10.1099/mgen.0.001081},
number = {8},
journal = {Microbial Genomics},
author = {Hall, Michael B. and Lima, Leandro and Coin, Lachlan J. M. and Iqbal, Zamin},
year = {2023},
pages = {001081},
}
Installation
Linux is currently the only supported platform (due to pandora
). For other platforms, the Docker container is the only option.
Conda
conda install -c bioconda drprg
Container
A Docker container is available for all commits/branches/versions. To view the available tags, visit https://quay.io/repository/mbhall88/drprg?tab=tags
For example, to use the latest commit on the main
branch, the URI is
$ TAG="latest"
$ URI="quay.io/mbhall88/drprg:$TAG"
Docker
To run drprg
using the above container with Docker
$ docker pull "$URI"
$ docker run -it "$URI" drprg --help
Singularity
To run drprg
using the above container with Singularity
$ singularity exec "docker://$URI" drprg --help
Prebuilt binary
If you use the prebuilt binary, you must have the external dependecies installed separately.
curl -sSL drprg.mbh.sh | sh
# or with wget
wget -nv -O - drprg.mbh.sh | sh
You can also pass options to the script like so
$ curl -sSL drprg.mbh.sh | sh -s -- --help
install.sh [option]
Fetch and install the latest version of drprg, if drprg is already
installed it will be updated to the latest version.
Options
-V, --verbose
Enable verbose output for the installer
-f, -y, --force, --yes
Skip the confirmation prompt during installation
-p, --platform
Override the platform identified by the installer
-b, --bin-dir
Override the bin installation directory [default: /usr/local/bin]
-a, --arch
Override the architecture identified by the installer [default: x86_64]
-B, --base-url
Override the base URL used for downloading releases [default: https://github.com/mbhall88/drprg/releases]
-h, --help
Display this help message
Cargo
If installing via cargo, you must have the external dependecies installed separately.
$ cargo install drprg
Local
Minimum supported Rust version: 1.65.0
$ cargo build --release
$ target/release/drprg -h
Dependencies
drprg
relies on:
You can install the dependencies using the provided justfile
# all dependencies
$ just deps
# pandora only
$ just pandora
# make_prg only
$ just makeprg
# mafft only
$ just mafft
# bcftools only
$ just bcftools
By default, the external dependencies will be downloaded to src/ext
. This can be
changed by specifying a path to EXTDIR
when installing the external dependencies.
$ just deps EXTDIR="some/other/dir"
Download prebuilt index
The index
subcommand can be used to download or list available indices.
To download the latest index version for all species
drprg index --download
# is the same as
drprg index --download all
If you only want the latest for a particular species
drprg index --download mtb
# is the same as
drprg index --download mtb@latest
Available indices
If you want to see what indices are available in the first place
drprg index --list
+--------------+---------+----------+------------+
| Name | Species | Version | Downloaded |
+--------------+---------+----------+------------+
| mtb@20230308 | mtb | 20230308 | N |
+--------------+---------+----------+------------+
You can specify a version with the following syntax
drprg index --download mtb@<version>
where <version>
is the version you want. If no version is provided, latest
is used.
To see more information about each prebuilt index, head over to https://github.com/mbhall88/drprg-index.
Output directory
By default, indices are stored in $HOME/.drprg/
, nested as species/species-version
.
So downloading version 20230308
of species mtb
will produce an index
at $HOME/.drprg/mtb/mtb-20230308/
.
You can change the default output directory ($HOME/.drprg/
) with the --outdir
option. If you do this, you'll need to pass the full path to the index when
using predict
.
Full usage
$ drprg index --help
Download and interact with indices
Usage: drprg index [OPTIONS] [NAME]
Arguments:
[NAME]
The name/path of the index to download
[default: all]
Options:
-d, --download
Download a prebuilt index
-v, --verbose
Use verbose output
-l, --list
List all available (and downloaded) indices
-t, --threads <INT>
Maximum number of threads to use
Use 0 to select the number automatically
[default: 1]
-o, --outdir <DIR>
Index directory
Use this if your indices are not in a default location, or you want to download them to a non-default location
[default: $HOME/.drprg/]
-F, --force
Overwrite any existing indices
-h, --help
Print help (see a summary with '-h')
Predict
The predict
subcommand is used to predict resistance for a sample from an index.
At its simplest
drprg predict -i reads.fq -x mtb -o outdir
drprg
is a bit "new-age" in that it assumes the reads are Nanopore. If they're
Illumina, use the -I/--illumina
option.
See Prediction Output documentation for a detailed description of what results/output files and formats to expect.
Required
Index
The index is provided via the -x/--index
option. It can either be a path to an index,
or the name of a downloaded index. As with the index
subcommmand, you can specify a version if you don't want to use the latest.
Input reads
A fastq (or fasta) file of the reads you want to predict resistance from - provided via
the -i/--input
option. If you have paired reads in two files, simply combine them and
pass the combined file - interleave order doesn't matter. For example
cat r1.fq r2.fq > combined.fq
drprg predict -i combined.fq ...
gzip
-compressed files are also accepted.
Optional
Sample name
Identifier to use for your output files. By default, it will be set to the file name
prefix (e.g. name
for a fastq named name.fq.gz
). Provided via the -s/--sample
option.
Minimum allele frequency
Provided via the -f/--maf
option. If an alternate allele has at least this fraction of
the depth, a minor resistance ("r") prediction is made. By default, this is set to 1.0
for Nanopore data (i.e. minor allele detection is off) and 0.1
when using
the --illumina
option. For example, if a variant is called as the reference allele for
Illumina reads, but an alternate allele has more than 10% of the depth on that position,
a minor resistance call is made for the alternate allele.
Ignore synonymous
Using the -S/--ignore-synonymous
option will prevent synonymous mutations from
appearing as unknown resistance calls. However, any synonymous mutations in the
catalogue will still be considered.
Quick usage
$ drprg predict -h
Predict drug resistance
Usage: drprg predict [OPTIONS] --index <DIR> --input <FILE>
Options:
-v, --verbose Use verbose output
-t, --threads <INT> Maximum number of threads to use [default: 1]
-h, --help Print help (see more with '--help')
Input/Output:
-x, --index <DIR> Name of a downloaded index or path to an index
-i, --input <FILE> Reads to predict resistance from
-o, --outdir <DIR> Directory to place output [default: .]
-s, --sample <SAMPLE> Identifier to use for the sample
-I, --illumina Sample reads are from Illumina sequencing
Filter:
-S, --ignore-synonymous Ignore unknown (off-catalogue) variants that cause a synonymous substitution
-f, --maf <FLOAT[0.0-1.0]> Minimum allele frequency to call variants [default: 1]
Full usage
$ drprg predict --help
Predict drug resistance
Usage: drprg predict [OPTIONS] --index <DIR> --input <FILE>
Options:
-p, --pandora <FILE>
Path to pandora executable. Will try in src/ext or $PATH if not given
-v, --verbose
Use verbose output
-m, --makeprg <FILE>
Path to make_prg executable. Will try in src/ext or $PATH if not given
-t, --threads <INT>
Maximum number of threads to use
Use 0 to select the number automatically
[default: 1]
-M, --mafft <FILE>
Path to MAFFT executable. Will try in src/ext or $PATH if not given
-h, --help
Print help (see a summary with '-h')
Input/Output:
-x, --index <DIR>
Name of a downloaded index or path to an index
-i, --input <FILE>
Reads to predict resistance from
Both fasta and fastq are accepted, along with compressed or uncompressed.
-o, --outdir <DIR>
Directory to place output
[default: .]
-s, --sample <SAMPLE>
Identifier to use for the sample
If not provided, this will be set to the input reads file path prefix
-I, --illumina
Sample reads are from Illumina sequencing
Filter:
-S, --ignore-synonymous
Ignore unknown (off-catalogue) variants that cause a synonymous substitution
-f, --maf <FLOAT[0.0-1.0]>
Minimum allele frequency to call variants
If an alternate allele has at least this fraction of the depth, a minor resistance ("r") prediction is made. Set to 1 to disable. If --illumina is passed, the default is 0.1
[default: 1]
--debug
Output debugging files. Mostly for development purposes
-d, --min-covg <INT>
Minimum depth of coverage allowed on variants
[default: 3]
-D, --max-covg <INT>
Maximum depth of coverage allowed on variants
[default: 2147483647]
-b, --min-strand-bias <FLOAT>
Minimum strand bias ratio allowed on variants
For example, setting to 0.25 requires >=25% of total (allele) coverage on both strands for an allele.
[default: 0.01]
-g, --min-gt-conf <FLOAT>
Minimum genotype confidence (GT_CONF) score allow on variants
[default: 0]
-L, --max-indel <INT>
Maximum (absolute) length of insertions/deletions allowed
-K, --min-frs <FLOAT>
Minimum fraction of read support
For example, setting to 0.9 requires >=90% of coverage for the variant to be on the called allele
[default: 0]
Prediction Output
Inside the predict
output directory (-o/--outdir
) you'll find a collection of files
and directories. We outline the files that are likely to be of interest to most users.
Prediction JSON
This is <sample>.drprg.json
. The value given to the -s/--sample
option dictates the name - i.e. <sample>.drprg.json
.
This is the only file most users will want/need to interact with. It contains the resistance prediction for each drug in the index's catalogue.
Example
This is a trimmmed (toy) example JSON output for a sample
{
"genes": {
"absent": [
"ahpC"
],
"present": [
"embA",
"embB",
"ethA",
"fabG1",
"gid",
"gyrA",
"gyrB",
"inhA",
"katG",
"pncA",
"rpoB",
"rrs"
]
},
"sample": "toy",
"susceptibility": {
"Amikacin": {
"evidence": [
{
"gene": "rrs",
"residue": "DNA",
"variant": "A1401X",
"vcfid": "b815ed3f"
}
],
"predict": "F"
},
"Ethambutol": {
"evidence": [
{
"gene": "embB",
"residue": "PROT",
"variant": "M306I",
"vcfid": "a290b118"
}
],
"predict": "r"
},
"Ethionamide": {
"evidence": [
{
"gene": "ethA",
"residue": "PROT",
"variant": "A381P",
"vcfid": "169f75d4"
}
],
"predict": "U"
},
"Isoniazid": {
"evidence": [
{
"gene": "fabG1",
"residue": "DNA",
"variant": "G-17T",
"vcfid": "de9b689e"
},
{
"gene": "katG",
"residue": "PROT",
"variant": "S315T",
"vcfid": "acaa8ca2"
}
],
"predict": "R"
},
"Levofloxacin": {
"evidence": [],
"predict": "S"
}
},
"version": {
"drprg": "0.1.1",
"index": "20230308"
}
}
The keys of the JSON are
genes
: This contains a list of genes in the index reference graph which are present and absentsample
: The value passed to the-s/--sample
optionsusceptibility
: The keys of this entry are the drugs in the index catalogue. Each drug's entry containsevidence
supporting the value in thepredict
section.
Predict
The predict
entry for a drug is the resistance prediction for the sample. Possible
values are
S
: susceptible. This is the "default" prediction. If no mutations are detected for the sample, it is assumed to be susceptibleF
: failed. Genotyping failed for one or more mutations for this drug. See the prediction VCF for more informationU
: unknown. One or more mutations that are not present in the index catalogue were detected in a gene associated with this drugR
: resistant. One or more mutations from the index catalogue that confer resistance were detectedu
orr
: The same as the uppercase versions, but the mutation(s) were detected in a minor allele.
Evidence
This is a list of the mutations supporting the prediction. The residue
is one of DNA
or PROT
indicating whether the mutation describes a nucleotide or amino acid change,
respectively.
The variant
is of the form <ref><pos><alt>
; where <ref>
is the reference sequence
at position <pos>
and <alt>
is the nucleotide/amino acid the reference is changed
to. See the catalogue docs for more information.
The vcfid
is the value in the VCF ID
column for this mutation,
making it easier to find a mutation in the VCF.
Prediction VCF
This is <sample>.drprg.bcf
. As this file is a BCF, you will need to
use bcftools
to view it - e.g. bcftools view sample.drprg.bcf
.
You should only need to interact with this file if you want further information about
the exact evidence supporting a mutation being called, why a mutation was called as
failed (F
), or why a mutation wasn't called. For those mutations in the JSON, you
can easily look them up using the vcfid
, which can be found in the ID
(third)
column of the BCF. Or you can just use grep
. For example, to look up the
Isoniazid S315T mutation
in katG from the example you can use
$ bcftools view sample.drprg.bcf | grep acaa8ca2
katG 1044 acaa8ca2 GC AC,CA,CC . PASS VC=PH_SNPs;GRAPHTYPE=SIMPLE;PDP=0,0.0123457,0,0.987654;VARID=katG_S315T;PREDICT=R GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 3:0.1.1,42:0,0,0,38:0.1.1,42:0,0,0,38:0.1.1,127:0,0,0,116:1,1,1,0:-523.019,-514.096,-523.019,-7.87925:506.217
All INFO
and FORMAT
fields are defined in the header of the BCF file. We recommend
reading the VCF/BCF specifications for help with how to interpret data in a VCF
file.
Build an index
The build
subcommand is used to build a custom index to predict
from.
$ drprg build -a annotation.gff3 -i catalogue.tsv -f ref.fa -o outdir
Required
Catalogue
Also referred to as a "panel" sometimes. This is the catalogue of mutations that confer
resistance (or susceptibility). See the Catalogue docs for more
details of what format this file must take. Provided via the -i/--panel
option.
Reference genome
A FASTA file of the reference genome that will act as the "skeleton" for the reference
graph. Provided via the -f/--fasta
option.
Annotation
The annotation is a GFF3 file for the species you are building the index for.
Provided via the -a/--gff
option.
VCF
A VCF to build the reference graph from. See VCF for a detailed description
of this file. Provided via the -b/--vcf
option. This option is mutually exclusive
with --prebuilt-prg
.
Optional
Expert rules
These describe a set of rules - rather than a single mutation - that cause resistance.
See Expert rules for a detailed description of the rules. Provided via
the -r/--rules
option.
Padding
Number of bases to add to the start and end of each gene (locus) in the reference graph.
Set to 100 by default. Provided via the -P/--padding
option.
Prebuilt reference graph
Advanced users only. For those who know how to build their own reference graphs, and would prefer to do that themselves by hand, rather than let DrPRG do it for them. The value passed to this option must be a directory containing:
- A reference graph file named
dr.prg
- A directory of multiple sequence alignments (
/msas/
) that the reference graph was built from. There must be a fasta file alignment for each gene in this directory<gene>.fa
. - An optional
pandora
index. If not present, one will be created.
The reference graph is expected to contain the reference sequence ( with padding) for each gene according to the annotation and reference sequence provided.
Provided with the -d/--prebuilt-prg
option.
Version
The version name for the index. Set to the current date (YYYYMMDD
) by default.
Provided via the --version
option.
Quick usage
$ drprg build -h
Build an index to predict resistance from
Usage: drprg build [OPTIONS] --gff <FILE> --panel <FILE> --fasta <FILE>
Options:
-v, --verbose Use verbose output
-t, --threads <INT> Maximum number of threads to use [default: 1]
-P, --padding <INT> Number of bases of padding to add to start and end of each gene [default: 100]
-I, --no-fai Don't index --fasta if an index doesn't exist
-C, --no-csi Don't index --vcf if an index doesn't exist
--version <VERSION> Version to use for the index [default: 20230405]
-h, --help Print help (see more with '--help')
Input/Output:
-a, --gff <FILE> Annotation file that will be used to gather information about genes in catalogue
-i, --panel <FILE> Panel/catalogue to build index for
-f, --fasta <FILE> Reference genome in FASTA format (must be indexed with samtools faidx)
-b, --vcf <FILE> An indexed VCF to build the index PRG from. If not provided, then a prebuilt PRG must be given. See `--prebuilt-prg`
-o, --outdir <DIR> Directory to place output [default: .]
-d, --prebuilt-prg <DIR> A prebuilt PRG to use
-r, --rules <FILE> "Expert rules" to be applied in addition to the catalogue
Full usage
$ drprg build --help
Build an index to predict resistance from
Usage: drprg build [OPTIONS] --gff <FILE> --panel <FILE> --fasta <FILE>
Options:
-p, --pandora <FILE>
Path to pandora executable. Will try in src/ext or $PATH if not given
-v, --verbose
Use verbose output
-m, --makeprg <FILE>
Path to make_prg executable. Will try in src/ext or $PATH if not given
-t, --threads <INT>
Maximum number of threads to use
Use 0 to select the number automatically
[default: 1]
-M, --mafft <FILE>
Path to MAFFT executable. Will try in src/ext or $PATH if not given
-B, --bcftools <FILE>
Path to bcftools executable. Will try in src/ext or $PATH if not given
-P, --padding <INT>
Number of bases of padding to add to start and end of each gene
[default: 100]
-l, --match-len <INT>
Minimum number of consecutive characters which must be identical for a match in make_prg
[default: 5]
-N, --max-nesting <INT>
Maximum nesting level when constructing the reference graph with make_prg
[default: 5]
-k, --pandora-k <INT>
Kmer size to use for pandora
[default: 15]
-w, --pandora-w <INT>
Window size to use for pandora
[default: 11]
-I, --no-fai
Don't index --fasta if an index doesn't exist
-C, --no-csi
Don't index --vcf if an index doesn't exist
--version <VERSION>
Version to use for the index
[default: 20230405]
-h, --help
Print help (see a summary with '-h')
Input/Output:
-a, --gff <FILE>
Annotation file that will be used to gather information about genes in catalogue
-i, --panel <FILE>
Panel/catalogue to build index for
-f, --fasta <FILE>
Reference genome in FASTA format (must be indexed with samtools faidx)
-b, --vcf <FILE>
An indexed VCF to build the index PRG from. If not provided, then a prebuilt PRG must be given. See `--prebuilt-prg`
-o, --outdir <DIR>
Directory to place output
[default: .]
-d, --prebuilt-prg <DIR>
A prebuilt PRG to use.
Only build the panel VCF and reference sequences - not the PRG. This directory MUST contain a PRG file named `dr.prg`, along, with a directory called `msas/` that contains an MSA fasta file for each gene `<gene>.fa`. There can optionally also be a pandora index file, but if not, the indexing will be performed by drprg. Note: the PRG is expected to contain the reference sequence for each gene according to the annotation and reference genome given (along with padding) and must be in the forward strand orientation.
-r, --rules <FILE>
"Expert rules" to be applied in addition to the catalogue.
CSV file with blanket rules that describe resistance (or susceptibility). The columns are <variant type>,<gene>,<start>,<end>,<drug(s)>. See the docs for a detailed explanation.
VCF to build reference graph
The input VCF (-b/--vcf
) for build
has some very particular requirements. If you
aren't very familiar with the formal VCF specifications, please read them
first.
This VCF file is what drprg
uses to construct the reference graph from. Variants from
it are applied to the skeleton reference genome. It
represents the population variation you would like to build into the reference graph.
For a greater understanding of these concepts, we recommend
reading this paper.
To add variants from multiple samples, simply use a multi-sample VCF.
The CHROM
field should be the gene name. As such, the POS
is with respect to the
gene. It's important to ensure the POS
takes into account
the padding that will be used when running build
. A helper
script for converting a VCF from reference genome coordinates to gene coordinates can be
found here. In addition, drprg
will raise an error if the coordinates
are incorrect with respect to the annotation and reference genome you provide.
Example from Mycobacterium tuberculosis
We provide an example of how the VCF for the work on M. tuberculosis (MTB) was generated.
Call variants for population
The first step is to select the population of samples you want variation from. For the MTB reference graph, we took a subset of the 15,211 global MTB isolates published by the CRyPTIC consortium. Briefly, the 15,211 samples were individually variant-called using clockwork
and then joint-genotyped using minos
. The "wide" VCF obtained at the end contains genotypes for all 15,211 samples against all variants discovered across the population.
Note: the reference genome you call these variants against must be the same the skeleton reference.
(Optional) subset population
In our running example, we have 15,211 samples in our VCF. This is far too much to be useful for building a reference graph. This paper illustrates the problem nicely. In addition, Figure 3.2 and Sections 3.6.4 and 3.6.5 in this thesis show that using too many samples does not provide any benefit to variant precision or recall, but costs increased memory usage and CPU time.
To create a representative subset, we split the VCF file into separate lineage VCFs, where all samples in a VCF are of the same lineage. We randomly chose 20 samples from each lineage 1 through 4, as well as 20 samples from all other lineages combined. In addition, we included 17 clinical samples representing MTB global diversity (lineages 1-6) to give a total of 117 samples.
We generated the subsets by listing all samples in the VCF file
bcftools query --list-samples > samples.txt
and then splitting this list by lineage based on lineage classifications. If we have a list of Lineage 1 samples in a text file, we can get 20 random samples from it with
shuf -n 20 L1.txt > L1.subsampled.txt
we then combine all of these text files of sample names into a single text file, extract the 117 samples, and filter the remaining variants with
bcftools view -S samples.subsampled.txt wide.vcf.gz |
bcftools view -a -U -c=1:nref -o MTB.vcf.gz
-a
trims ALT alleles not seen in the genotype fields, -U
excludes sites without a called genotype, and -c=1:nref
removes positions there is no non-reference alleles.
You can find the resulting VCF (MTB.vcf.gz
) we generated here.
(Optional) Manually add "orphan" mutations
We also manually added ~50 mutations that were commonly found in minor frequencies but were not contained in the population VCF.
This is done by creating a file of mutations of the form <gene>_<mutation>
- this is effectively column 1 and 2 from the catalogue separated by an underscore. For example
rpoB_I491F
rplC_C154R
ddn_L49P
gid_Q125*
rrs_g1484t
embB_L74R
the difference between this and the catalogue though is that DNA mutations should have the nucleotides in lowercase (e.g. rrs_g1484t
). A VCF can be generated for this list of mutations using this script.
python create_orphan_mutations.py -r ref.fa -m orphan_mutations.txt \
-g annotation.gff -o orphan.vcf
We then merge this VCF of "orphan" mutations into the population VCF with
bcftools merge MTB.vcf.gz orphan.vcf |
bcftools norm -c e -f reference.fa -o merged.vcf
-c e
just tells bcftools norm
to error if the REF alleles in the VCFs do not match the reference.
Extract catalogue genes from VCF
DrPRG expects the VCF CHROM
field to be the name of the gene, rather than the reference contig/chromosome name - which is what we currently have. As such, the POS
must be with respect to the gene. It's important to ensure the POS
takes into account the padding that will be used when running build
.
A helper script for converting a VCF from reference genome coordinates to gene coordinates can be found here. In addition, drprg
will raise an error if the coordinates are incorrect with respect to the annotation and reference genome you provide. By providing this script with the catalogue/panel you will use in build
, it will extract only those variants that fall within the genes in your catalogue.
python extract_panel_genes_from_vcf.py --padding 100 \
-i catalogue.tsv -g annotation.gff --vcf merged.vcf -o final.vcf
final.vcf
can now be used as the input VCF for build
.
Catalogue of mutations
Also referred to as a "panel" sometimes. This is the catalogue of mutations that confer
resistance (or susceptibility). Provided via the -i/--panel
option.
This file is a tab-delimited (TSV) file with four columns
- The gene name of the mutation
- The mutation in the form
<ref><pos><alt>
, where<ref>
is the reference nucleotide or amino acid,<pos>
is the 1-based position in the gene/protein, and<alt>
is the nucleotide or amino acid the reference is changed to.<pos>
is with respect to the type of mutation - i.e. if the mutation is a amino acid change, the<pos>
must be position within the protein (codon position). Promoter mutations can be specied with a negative symbol - e.g. -10 means 10 nucleotide before the first position in the gene. One-letter codes are to be used for amino acids. - The residue type -
DNA
(nucleotide change) orPROT
(amino acid change). - Comma-separated list of the drug(s) the mutation confers resistance to. Use
NONE
if the mutation is not associated with resistance.
Example
pncA TCG196TAG DNA Pyrazinamide
pncA T142R PROT Pyrazinamide
tlyA S159* PROT Capreomycin
embB M306N PROT Ethambutol
rpoB C1275CCA DNA Rifampicin
gid R137P PROT Streptomycin
gid R118S PROT Streptomycin
tlyA G123GC DNA Capreomycin
pncA G97D PROT Pyrazinamide
fabG1 C-15X DNA Ethionamide,Isoniazid
Expert rules
These are blanket rules that describe resistance (or susceptibility). The file is a CSV
with each row representing a rule and is passed to drprg build
via the --rules
option. The format of each row is
vartype,gene,start,end,drug
vartype
: the variant type of the rule. Supported types are:frameshift
- Any insertion or deletion whose length is not a multiple of threemissense
- A DNA change that results in a different amino acidnonsense
- A DNA change that results in a stop codon instead of an amino acidabsence
- Gene is absent
gene
: the name of the gene the rule applies tostart
: An optional start position for the rule to apply from. The position is in codon coordinates where the rule applies to amino acid changes and is 1-based inclusive. If not provided, the start of the gene is inferred. If you want to include the upstream (promoter) region of the gene, use negative coordinates.end
: An optional end position for the rule to apply to. The position is in codon coordinates where the rule applies to amino acid changes and is 1-based inclusive. If not provided, the end of the gene is inferred.drug
: A semi-colon-delimited (;
) list of drugs the rule impacts. If the rule confers susceptibility, useNONE
for this column.
If there are certain rules you need for your species-of-interest, raise an issue, and we can look at implementing it.
Example
This is an example of the M. tuberculosis expert rules file used in our paper.
missense,rpoB,426,452,Rifampicin
nonsense,rpoB,426,452,Rifampicin
frameshift,rpoB,1276,1356,Rifampicin
nonsense,katG,,,Isoniazid
frameshift,katG,,,Isoniazid
absence,katG,,,Isoniazid
nonsense,ethA,,,Ethionamide
frameshift,ethA,,,Ethionamide
absence,ethA,,,Ethionamide
nonsense,gid,,,Streptomycin
frameshift,gid,,,Streptomycin
absence,gid,,,Streptomycin
nonsense,pncA,,,Pyrazinamide
frameshift,pncA,,,Pyrazinamide
absence,pncA,,,Pyrazinamide
missense,katG,315,315,Isoniazid
missense,gid,125,125,Streptomycin
missense,rpoB,425,425,Rifampicin
missense,gid,136,136,Streptomycin
The row
frameshift,pncA,,,Pyrazinamide
says that a frameshift anywhere within the pncA gene will cause resistance to Pyrazinamide
nonsense,rpoB,426,452,Rifampicin
frameshift,rpoB,1276,1356,Rifampicin
these two rules illustrate the context of the start and end coordinates. In the first row, we say that any nonsense mutation between 426 and 452 in rpoB causes resistance to Rifampicin. As nonsense mutations only apply to amino acid changes, the coordinates are in codon-space. Whereas the second row describes a frameshift, which only applies to nucleotides; therefore, 1276 and 1356 are in bases-space (i.e. the 1276th nucleotide/base). (As an aside, these two rules both apply to the same region - the RRDR)
missense,katG,315,315,Isoniazid
describes any missense mutation at position 315 in katG causing isoniazid resistance.
Contributing
If you would like to contribute to Dr. PRG, feel free to open a pull request with suggested changes, or even raise an issue to discuss potential contributions.
If you would like to contribute a prebuilt index for a species, head on over to https://github.com/mbhall88/drprg-index
Changelog
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
[Unreleased]
0.1.1 - 2023-04-06
Fixed
- CI issue that meant no binaries were built
0.1.0 - 2023-04-06
- First release! Everything is new!