pafpy
A lightweight library for working with PAF (Pairwise mApping Format) files.
Install
PyPi
pip install pafpy
Conda
conda install -c bioconda pafpy
Locally
If you would like to install locally, the recommended way is using poetry.
git clone https://github.com/mbhall88/pafpy.git
cd pafpy
make install
# to check the library is installed run
poetry run python -c "from pafpy import PafRecord;print(str(PafRecord()))"
# you should see a (unmapped) PAF record printed to the terminal
# you can also run the tests if you like
make test-code
Usage
If there is any functionality you feel is missing or would make pafpy
more
user-friendly, please raise an issue with a feature request.
Basic
In the below basic usage pattern, we collect the BLAST identity of all primary alignments in our PAF file into a list.
from typing import List
from pafpy import PafFile
path = "path/to/sample.paf"
identities: List[float] = []
with PafFile(path) as paf:
for record in paf:
if record.is_primary():
identity = record.blast_identity()
identities.append(identity)
Another use case might be that we want to get the identifiers of all records aligned to a specific contig, but only keep the alignments where more than 50% of the query (read) is aligned.
from typing import List
from pafpy import PafFile
path = "path/to/sample.paf"
contig = "chr1"
min_covg = 0.5
identifiers: List[str] = []
with PafFile(path) as paf:
for record in paf:
if record.tname == contig and record.query_coverage > min_covg:
identifiers.append(record.qname)
Advanced
Manual open/close
Sometimes a with
context manager is not appropriate, and you would like to open the
file and pass it around manually.. This can be achieved using the
PafFile.open()
and PafFile.close()
methods. As an example,
let's say we define a function that takes a PafFile
and counts the
number of records in it.
from pafpy import PafFile
def count_records(paf: PafFile) -> int:
"""Counts the number of alignments in a PAF file."""
return sum(1 for _ in paf)
path = "path/to/sample.paf"
paf = PafFile(path).open()
num_records = count_records(paf)
# it's good practise to close the file yourself rather than rely on the garbage collector
paf.close()
assert paf.closed
Admittedly, this is a contrived example, and we could have still used the context manager, but you get the point 😉.
Compressed input
If your PAF file has been compressed with gzip
you don't need to do anything -
just pass the filepath as you would any other. Compression is tested by reading the
first two bytes of the file, so the filepath doesn't even need a .gz
extension.
from pafpy import PafFile
path = "sample.paf.gz"
with PafFile(path) as paf:
for record in paf:
# do something with your records
Working with file streams/objects
An already-open file can also be used to construct a PafFile
object. If
a PafFile
is constructed with an open file like this there is no need to open the
PafFile
.
with open("sample.paf") as fileobj:
paf = PafFile(fileobj)
# no need to open the paf
for record in paf:
# do something with record
If you want to use stdin as your PAF source, it is strongly recommended you use "-"
to
construct the PafFile
. The library will auto-detect if stdin is compressed and will
decompress it accordingly.
paf = PafFile("-")
# note: stdin is a stream so doesn't need to be opened
for record in paf:
# do something with record
Advanced users: If you must use
sys.stdin
and not"-"
then please passsys.stdin.buffer
Fetch individual records
for
loops aren't the only way of retrieving records in a file. You can also ask for
records manually.
from pafpy import PafFile
path = "path/to/sample.paf"
with PafFile(path) as paf:
record = next(paf)
# do something with your lonely record
Working with strands
There is an enum for representing the strand field - Strand
. It has a
couple of advantages over just using a str
, but the main one is readability. Let's
count the number of records that mapped to the reverse strand.
from pafpy import PafFile, Strand
path = "path/to/sample.paf"
num_reverse = 0
with PafFile(path) as paf:
for record in paf:
if record.strand is Strand.Reverse:
num_reverse += 1
You can convert strands to and from str
quite easily.
from pafpy import Strand
assert Strand("+") is Strand.Forward
assert str(Strand.Reverse) == "-"
assert str(Strand.Unmapped) == "*"
PAF records
The object you will likely spend the most time with is PafRecord
.
Refer to the API docs for documentation on all the functions and member
variables this class contains.
Let's look at a few use cases. Construction of PafRecord
s is quite flexible as we
wanted to make it very easy to write unit tests and construct arbitrary records without
having to create an entire PAF file to do so.
There are two ways to construct a PafRecord
:
- The default constructor: where you specify each member variable manually.
- The
PafRecord.from_str()
factory constructor: where you create aPafRecord
directly from astr
.
from pafpy import PafRecord, Strand, Tag
# default constructor
record1 = PafRecord(
qname="query_name",
qlen=1239,
qstart=65,
qend=1239,
strand=Strand.Forward,
tname="target_name",
tlen=4378340,
tstart=2555250,
tend=2556472,
mlen=1139,
blen=1228,
mapq=60,
tags={"NM": Tag.from_str("NM:i:8")},
)
# from_str factory constructor
line = "query_name\t1239\t65\t1239\t+\ttarget_name\t4378340\t2555250\t2556472\t1139\t1228\t60\tNM:i:8"
record2 = PafRecord.from_str(line)
assert record1 == record2
SAM-like optional fields/tags
Each additional column after the 12th column in a PAF file is a SAM-like tag. The
Tag
class tries to make working with tags simple. You can extract tags from
a PafRecord
using PafRecord.get_tag()
, or you may like to construct
one yourself.
Let's look at some of these options.
from pafpy import PafRecord, Tag
line = "query_name\t1239\t65\t1239\t+\ttarget_name\t4378340\t2555250\t2556472\t1139\t1228\t60\tNM:i:8"
record = PafRecord.from_str(line)
tag = record.get_tag("NM")
assert tag.tag == "NM"
assert tag.type == "i"
assert tag.value == 8
# we can construct tags from scratch with a str
tag = Tag.from_str("tp:A:P")
assert tag.value == "P"
# or with the default constructor
tag = Tag(tag="de", type="f", value=0.2)
One thing to notice is that the value
is in the correct type as specified in the tag
specs. However, if you define a Tag
with the default constructor, you can bypass
this. The Tag.from_str()
method validates the tag string to ensure it strictly
matches the specifications. As such, we recommend using this method when constructing
tags. But, there may be cases where you don't want to adhere to this convention, and in
those cases, use the default constructor.
from pafpy import Tag, InvalidTagFormat
# type 'i' signifies an integer
tag = Tag(tag="NM", type="i", value="foo")
assert tag.value == "foo"
# if we try and do this with from_str we get an error
tag = None
err_msg = ""
try:
tag = Tag.from_str("NM:i:foo")
except InvalidTagFormat as err:
err_msg = str(err)
assert err_msg == "VALUE of tag NM:i:foo is not the expected TYPE"
Contributing
Contributions are very welcome. Please ensure all of your contributions are made via a pull request from a fork.
Setup
The recommended development environment is through poetry. If you prefer to
use something else that is fine, but beware it could lead to different environment
behaviour. The poetry.lock
file under version control should ensure you are
set up with the same environment as anyone else contributing to the library. Most
of the standard development tasks are managed through a Makefile
. The
Makefile
assumes that you are using poetry
.
After cloning your fork locally and entering the project directory, you can set up the poetry environment with
make install
Formatting
All code is formatted with black
and isort
.
make fmt
Linting
flake8
handles linting. Please ensure there are no warnings before pushing any
work.
make lint
Testing
To test both the code and documentation, run
make test
Code
All unit tests are contained in the tests
directory. If you add any code, please
ensure it is tested. Tests are handled by pytest
. To test just the code, without the
also testing the documentation, run
make test-code
Docs
The document testing is orchestrated by scripts/mdpydoctest
. This
script will extract all markdown python code blocks from docstrings in python files and
write a test file with a unit test for each snippet. It then runs pytest
on this file
to ensure all code examples in the docstrings are correct. To test the docs, run
make test-docs
Coverage
Please keep the project's code coverage as high as possible. To check the code coverage, run
make coverage
This should show the coverage on the terminal and also open an HTML report in your web browser.
Documentation
The code is documented using markdown docstrings. The convention this project follows is
akin to that used by the Rust programming language. The beginning of the docstring
should explain what the function does and if it returns anything. This is followed,
where relevant, by an example section ## Example
. All examples should be valid,
self-contained examples that can be copied and pasted into a python shell and executed
successfully (assuming the user has pafpy
installed). These code snippets must be in a
code block annotated as py
or python
. See the code for examples. If the code being
documented can raise an exception, the type(s) of errors should be documented in an
## Errors
section also.
The documentation can be served locally in a browser so that you can view changes in realtime by running
make serve-docs
and then navigating to the URL printed in the terminal (most likely http://localhost:8080).
The docs can also be built by running
make docs
Then open docs/index.html
to view the documentation that will be deployed on pushing to master
.
Committing
There is a convenience rule in the Makefile
that can be run prior to committing that
will run most of the above tasks for you
make precommit
Expand source code
"""A lightweight library for working with [PAF][PAF] (Pairwise mApping Format) files.
[![GitHub Workflow Status](https://img.shields.io/github/workflow/status/mbhall88/pafpy/Python_package)](https://github.com/mbhall88/pafpy/actions)
[![codecov](https://codecov.io/gh/mbhall88/pafpy/branch/master/graph/badge.svg)](https://codecov.io/gh/mbhall88/pafpy)
[![PyPI](https://img.shields.io/pypi/v/pafpy)](https://pypi.org/project/pafpy/)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/pafpy)
![License](https://img.shields.io/github/license/mbhall88/pafpy)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
# Install
### PyPi
```sh
pip install pafpy
```
### Conda
```sh
conda install -c bioconda pafpy
```
### Locally
If you would like to install locally, the recommended way is using [poetry][poetry].
```sh
git clone https://github.com/mbhall88/pafpy.git
cd pafpy
make install
# to check the library is installed run
poetry run python -c "from pafpy import PafRecord;print(str(PafRecord()))"
# you should see a (unmapped) PAF record printed to the terminal
# you can also run the tests if you like
make test-code
```
.. include:: ../docs/USAGE.md
[poetry]: https://python-poetry.org/
[PAF]: https://github.com/lh3/miniasm/blob/master/PAF.md
[docs]: https://pafpy.xyz
[blast]: https://lh3.github.io/2018/11/25/on-the-definition-of-sequence-identity#blast-identity
[contribute]: https://github.com/mbhall88/pafpy/blob/master/CONTRIBUTING.md
.. include:: ../CONTRIBUTING.md
"""
from pafpy.__version__ import __version__ # noqa: F401
from pafpy.paffile import PafFile # noqa: F401
from pafpy.pafrecord import AlignmentType, MalformattedRecord, PafRecord # noqa: F401
from pafpy.strand import Strand # noqa: F401
from pafpy.tag import InvalidTagFormat, Tag, TagType, UnknownTagTypeChar # noqa: F401
API documentation
pafpy.paffile
-
This module contains objects for working with PAF files.
The main class of interest here is
PafFile
. It provides an interface to open/close a PAF file and to iterate over the alignment records within the file.To use
PafFile
within your code, import it like sofrom pafpy import PafFile
pafpy.pafrecord
-
This module contains objects for working with single alignment records within a PAF file.
The main class of interest here is
PafRecord
. It provides a set of member variables relating to each field within the record, as well as some convenience methods for common tasks.To use
PafRecord
within your code, import it like sofrom pafpy import PafRecord
pafpy.strand
-
A module containing objects relating to the strand field within a PAF file.
The main class of interest here is
Strand
. To use it within your code, import it like sofrom pafpy import Strand
pafpy.tag
pafpy.utils
-
This module contains utility functions unlikely to be of use to anyone else.
from pafpy.utils import first_n_bytes, is_compressed