pafpy

A lightweight library for working with PAF (Pairwise mApping Format) files.

GitHub Workflow Status codecov PyPI PyPI - Python Version License Code style: black

Install

PyPi

pip install pafpy

Conda

conda install -c bioconda pafpy

Locally

If you would like to install locally, the recommended way is using poetry.

git clone https://github.com/mbhall88/pafpy.git
cd pafpy
make install
# to check the library is installed run
poetry run python -c "from pafpy import PafRecord;print(str(PafRecord()))"
# you should see a (unmapped) PAF record printed to the terminal
# you can also run the tests if you like
make test-code

Usage

If there is any functionality you feel is missing or would make pafpy more user-friendly, please raise an issue with a feature request.

Basic

In the below basic usage pattern, we collect the BLAST identity of all primary alignments in our PAF file into a list.

from typing import List
from pafpy import PafFile

path = "path/to/sample.paf"

identities: List[float] = []
with PafFile(path) as paf:
    for record in paf:
        if record.is_primary():
            identity = record.blast_identity()
            identities.append(identity)

Another use case might be that we want to get the identifiers of all records aligned to a specific contig, but only keep the alignments where more than 50% of the query (read) is aligned.

from typing import List
from pafpy import PafFile

path = "path/to/sample.paf"

contig = "chr1"
min_covg = 0.5
identifiers: List[str] = []
with PafFile(path) as paf:
    for record in paf:
        if record.tname == contig and record.query_coverage > min_covg:
            identifiers.append(record.qname)

Advanced

Manual open/close

Sometimes a with context manager is not appropriate, and you would like to open the file and pass it around manually.. This can be achieved using the PafFile.open() and PafFile.close() methods. As an example, let's say we define a function that takes a PafFile and counts the number of records in it.

from pafpy import PafFile

def count_records(paf: PafFile) -> int:
    """Counts the number of alignments in a PAF file."""
    return sum(1 for _ in paf)

path = "path/to/sample.paf"

paf = PafFile(path).open()
num_records = count_records(paf)
# it's good practise to close the file yourself rather than rely on the garbage collector
paf.close()
assert paf.closed

Admittedly, this is a contrived example, and we could have still used the context manager, but you get the point 😉.

Compressed input

If your PAF file has been compressed with gzip you don't need to do anything - just pass the filepath as you would any other. Compression is tested by reading the first two bytes of the file, so the filepath doesn't even need a .gz extension.

from pafpy import PafFile

path = "sample.paf.gz"

with PafFile(path) as paf:
    for record in paf:
        # do something with your records

Working with file streams/objects

An already-open file can also be used to construct a PafFile object. If a PafFile is constructed with an open file like this there is no need to open the PafFile.

with open("sample.paf") as fileobj:
    paf = PafFile(fileobj)
    # no need to open the paf
    for record in paf:
        # do something with record

If you want to use stdin as your PAF source, it is strongly recommended you use "-" to construct the PafFile. The library will auto-detect if stdin is compressed and will decompress it accordingly.

paf = PafFile("-")
# note: stdin is a stream so doesn't need to be opened
for record in paf:
    # do something with record

Advanced users: If you must use sys.stdin and not "-" then please pass sys.stdin.buffer

Fetch individual records

for loops aren't the only way of retrieving records in a file. You can also ask for records manually.

from pafpy import PafFile

path = "path/to/sample.paf"

with PafFile(path) as paf:
    record = next(paf)
    # do something with your lonely record

Working with strands

There is an enum for representing the strand field - Strand. It has a couple of advantages over just using a str, but the main one is readability. Let's count the number of records that mapped to the reverse strand.

from pafpy import PafFile, Strand

path = "path/to/sample.paf"

num_reverse = 0
with PafFile(path) as paf:
    for record in paf:
        if record.strand is Strand.Reverse:
            num_reverse += 1

You can convert strands to and from str quite easily.

from pafpy import Strand

assert Strand("+") is Strand.Forward
assert str(Strand.Reverse) == "-"
assert str(Strand.Unmapped) == "*"

PAF records

The object you will likely spend the most time with is PafRecord. Refer to the API docs for documentation on all the functions and member variables this class contains.

Let's look at a few use cases. Construction of PafRecords is quite flexible as we wanted to make it very easy to write unit tests and construct arbitrary records without having to create an entire PAF file to do so.
There are two ways to construct a PafRecord:

  1. The default constructor: where you specify each member variable manually.
  2. The PafRecord.from_str() factory constructor: where you create a PafRecord directly from a str.
from pafpy import PafRecord, Strand, Tag

# default constructor
record1 = PafRecord(
        qname="query_name",
        qlen=1239,
        qstart=65,
        qend=1239,
        strand=Strand.Forward,
        tname="target_name",
        tlen=4378340,
        tstart=2555250,
        tend=2556472,
        mlen=1139,
        blen=1228,
        mapq=60,
        tags={"NM": Tag.from_str("NM:i:8")},
)

# from_str factory constructor
line = "query_name\t1239\t65\t1239\t+\ttarget_name\t4378340\t2555250\t2556472\t1139\t1228\t60\tNM:i:8"
record2 = PafRecord.from_str(line)

assert record1 == record2

SAM-like optional fields/tags

Each additional column after the 12th column in a PAF file is a SAM-like tag. The Tag class tries to make working with tags simple. You can extract tags from a PafRecord using PafRecord.get_tag(), or you may like to construct one yourself.
Let's look at some of these options.

from pafpy import PafRecord, Tag

line = "query_name\t1239\t65\t1239\t+\ttarget_name\t4378340\t2555250\t2556472\t1139\t1228\t60\tNM:i:8"
record = PafRecord.from_str(line)

tag = record.get_tag("NM")
assert tag.tag == "NM"
assert tag.type == "i"
assert tag.value == 8

# we can construct tags from scratch with a str
tag = Tag.from_str("tp:A:P")
assert tag.value == "P"

# or with the default constructor
tag = Tag(tag="de", type="f", value=0.2)

One thing to notice is that the value is in the correct type as specified in the tag specs. However, if you define a Tag with the default constructor, you can bypass this. The Tag.from_str() method validates the tag string to ensure it strictly matches the specifications. As such, we recommend using this method when constructing tags. But, there may be cases where you don't want to adhere to this convention, and in those cases, use the default constructor.

from pafpy import Tag, InvalidTagFormat

# type 'i' signifies an integer
tag = Tag(tag="NM", type="i", value="foo")
assert tag.value == "foo"

# if we try and do this with from_str we get an error
tag = None
err_msg = ""
try:
    tag = Tag.from_str("NM:i:foo")
except InvalidTagFormat as err:
    err_msg = str(err)
assert err_msg == "VALUE of tag NM:i:foo is not the expected TYPE"

Contributing

Contributions are very welcome. Please ensure all of your contributions are made via a pull request from a fork.

Setup

The recommended development environment is through poetry. If you prefer to use something else that is fine, but beware it could lead to different environment behaviour. The poetry.lock file under version control should ensure you are set up with the same environment as anyone else contributing to the library. Most of the standard development tasks are managed through a Makefile. The Makefile assumes that you are using poetry.

After cloning your fork locally and entering the project directory, you can set up the poetry environment with

make install

Formatting

All code is formatted with black and isort.

make fmt

Linting

flake8 handles linting. Please ensure there are no warnings before pushing any work.

make lint

Testing

To test both the code and documentation, run

make test

Code

All unit tests are contained in the tests directory. If you add any code, please ensure it is tested. Tests are handled by pytest. To test just the code, without the also testing the documentation, run

make test-code

Docs

The document testing is orchestrated by scripts/mdpydoctest. This script will extract all markdown python code blocks from docstrings in python files and write a test file with a unit test for each snippet. It then runs pytest on this file to ensure all code examples in the docstrings are correct. To test the docs, run

make test-docs

Coverage

Please keep the project's code coverage as high as possible. To check the code coverage, run

make coverage

This should show the coverage on the terminal and also open an HTML report in your web browser.

Documentation

The code is documented using markdown docstrings. The convention this project follows is akin to that used by the Rust programming language. The beginning of the docstring should explain what the function does and if it returns anything. This is followed, where relevant, by an example section ## Example. All examples should be valid, self-contained examples that can be copied and pasted into a python shell and executed successfully (assuming the user has pafpy installed). These code snippets must be in a code block annotated as py or python. See the code for examples. If the code being documented can raise an exception, the type(s) of errors should be documented in an ## Errors section also.

The documentation can be served locally in a browser so that you can view changes in realtime by running

make serve-docs

and then navigating to the URL printed in the terminal (most likely http://localhost:8080).

The docs can also be built by running

make docs

Then open docs/index.html to view the documentation that will be deployed on pushing to master.

Committing

There is a convenience rule in the Makefile that can be run prior to committing that will run most of the above tasks for you

make precommit
Expand source code
"""A lightweight library for working with [PAF][PAF] (Pairwise mApping Format) files.

[![GitHub Workflow Status](https://img.shields.io/github/workflow/status/mbhall88/pafpy/Python_package)](https://github.com/mbhall88/pafpy/actions)
[![codecov](https://codecov.io/gh/mbhall88/pafpy/branch/master/graph/badge.svg)](https://codecov.io/gh/mbhall88/pafpy)
[![PyPI](https://img.shields.io/pypi/v/pafpy)](https://pypi.org/project/pafpy/)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/pafpy)
![License](https://img.shields.io/github/license/mbhall88/pafpy)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)


# Install

### PyPi

```sh
pip install pafpy
```

### Conda

```sh
conda install -c bioconda pafpy
```

### Locally

If you would like to install locally, the recommended way is using [poetry][poetry].

```sh
git clone https://github.com/mbhall88/pafpy.git
cd pafpy
make install
# to check the library is installed run
poetry run python -c "from pafpy import PafRecord;print(str(PafRecord()))"
# you should see a (unmapped) PAF record printed to the terminal
# you can also run the tests if you like
make test-code
```

.. include:: ../docs/USAGE.md

[poetry]: https://python-poetry.org/
[PAF]: https://github.com/lh3/miniasm/blob/master/PAF.md
[docs]: https://pafpy.xyz
[blast]: https://lh3.github.io/2018/11/25/on-the-definition-of-sequence-identity#blast-identity
[contribute]: https://github.com/mbhall88/pafpy/blob/master/CONTRIBUTING.md

.. include:: ../CONTRIBUTING.md
"""
from pafpy.__version__ import __version__  # noqa: F401
from pafpy.paffile import PafFile  # noqa: F401
from pafpy.pafrecord import AlignmentType, MalformattedRecord, PafRecord  # noqa: F401
from pafpy.strand import Strand  # noqa: F401
from pafpy.tag import InvalidTagFormat, Tag, TagType, UnknownTagTypeChar  # noqa: F401

API documentation

pafpy.paffile

This module contains objects for working with PAF files.

The main class of interest here is PafFile. It provides an interface to open/close a PAF file and to iterate over the alignment records within the file.

To use PafFile within your code, import it like so

from pafpy import PafFile
pafpy.pafrecord

This module contains objects for working with single alignment records within a PAF file.

The main class of interest here is PafRecord. It provides a set of member variables relating to each field within the record, as well as some convenience methods for common tasks.

To use PafRecord within your code, import it like so

from pafpy import PafRecord
pafpy.strand

A module containing objects relating to the strand field within a PAF file.

The main class of interest here is Strand. To use it within your code, import it like so

from pafpy import Strand
pafpy.tag

A module for wrapping SAM-like optional fields (tags) generally used in PAF files.

The full specifications can be found here.

The main class of interest in this module is Tag. It can be imported into your project like so

from pafpy import Tag
pafpy.utils

This module contains utility functions unlikely to be of use to anyone else.

from pafpy.utils import first_n_bytes, is_compressed