pbcore.io

The pbcore.io package provides a number of lightweight interfaces to PacBio data files and other standard bioinformatics file formats. Preferred usage is to import classes directly from the pbcore.io package, e.g.:

>>> from pbcore.io import CmpH5Reader

The classes within pbcore.io adhere to a few conventions, in order to provide a uniform API:

  • Each data file type is thought of as a container of a Record type; all Reader classes support streaming access, and CmpH5Reader and BasH5Reader additionally provide random-access to alignments/reads.

  • The constructor argument needed to instantiate Reader and Writer objects can be either a filename (which can be suffixed by ”.gz” for all but the h5 file types) or an open file handle. The reader/writer classes will do what you would expect.

  • The reader/writer classes all support the context manager idiom. Meaning, if you write:

    >>> with CmpH5Reader("aligned_reads.cmp.h5") as r:
    ...   print r[0].read()
    

    the CmpH5Reader object will be automatically closed after the block within the “with” statement is executed.

BAM/cmp.h5 compatibility: quick start

If you have an application that uses the CmpH5Reader and you want to start using BAM files, your best bet is to use the following generic factory functions:

pbcore.io.openIndexedAlignmentFile(fname, referenceFastaFname=None, sharedIndex=None)

Factory function to get a handle to a reader for an alignment file (cmp.h5 or BAM), requiring index capability (built-in for cmp.h5; requires bam.pbi index for BAM

The reference FASTA, if provided, must have a FASTA index (fasta.fai).

pbcore.io.openAlignmentFile(fname, referenceFastaFname=None, sharedIndex=None)

Factory function to get a handle to a reader for an alignment file (cmp.h5 or BAM), not requiring index capability

(A sharedIndex can still be passed for opening a cmp.h5, for which the index is compulsory.)

Note

Since BAM files contain a subset of the information that was present in cmp.h5 files, you will need to provide these functions an indexed FASTA file for your reference. For full compatibility, you need the openIndexedAlignmentFile function, which requires the existence of a bam.pbi file (PacBio BAM index companion file).

bas.h5 / bax.h5 Formats (PacBio basecalls file)

The bas.h5/ bax.h5 file formats are container formats for PacBio reads, built on top of the HDF5 standard. Originally there was just one bas.h5, but eventually “multistreaming” came along and we had to split the file into three bax.h5 parts and one bas.h5 file containing pointers to the parts. Use BasH5Reader to read any kind of bas.h5 file, and BaxH5Reader to read a bax.h5.

Note

In contrast to GFF, for example, the bas.h5 read coordinate system is 0-based and start-inclusive/end-exclusive, i.e. the same convention as Python and the C++ STL.

class pbcore.io.BasH5Reader(*args)

The BasH5Reader provides access to the basecall and pulse metric data encoded in PacBio bas.h5 files. To access data using a BasH5Reader, the standard idiom is:

  1. Index into the BasH5Reader using the ZMW hole number to get a Zmw object:

    >>> b
    <BasH5Reader: m110818_075520_42141_c100129202555500000315043109121112_s1_p0>
    >>> zmw8 = b[8]
    >>> zmw8
    <Zmw: m110818_075520_42141_c100129202555500000315043109121112_s1_p0/8>
    
  2. Extract ZmwRead objects from the Zmw object by:

    • Using the .subreads property to extract the subreads, which are the subintervals of the raw read corresponding to the SMRTbell insert:

      >>> subreads = zmw8.subreads
      >>> print subreads
      [<ZmwRead: m110818_075520_42141_c100129202555500000315043109121112_s1_p0/8/3381_3881>,
      <ZmwRead: m110818_075520_42141_c100129202555500000315043109121112_s1_p0/8/3924_4398>,
      <ZmwRead: m110818_075520_42141_c100129202555500000315043109121112_s1_p0/8/4445_4873>,
      <ZmwRead: m110818_075520_42141_c100129202555500000315043109121112_s1_p0/8/4920_5354>,
      <ZmwRead: m110818_075520_42141_c100129202555500000315043109121112_s1_p0/8/5413_5495>]
      
    • For CCS bas files, using the .ccsRead property to extract the CCS (consensus) read, which is a consensus sequence precomputed from the subreads. Older bas files, from when CCS was computed on the instrument, may contain both CCS- and sub- reads.

      >>> zmw8.ccsRead
      <CCSZmwRead: m110818_075520_42141_c100129202555500000315043109121112_s1_p0/8/ccs>
      
    • Use the .read() method to get the full raw read, or .read(start, end) to extract a custom subinterval.

      >>> zmw8.read()
      <ZmwRead: m110818_075520_42141_c100129202555500000315043109121112_s1_p0/8/3381_5495>
      >>> zmw8.read(3390, 3400)
      <ZmwRead: m110818_075520_42141_c100129202555500000315043109121112_s1_p0/8/3390_3400>
      
  3. With a ZmwRead object in hand, extract the desired basecalls and pulse metrics:

    >>> subreads[0].readName
    "m110818_075520_42141_c100129202555500000315043109121112_s1_p0/8/3381_3881"
    >>> subreads[0].basecalls()
    "AGCCCCGTCGAGAACATACAGGTGGCCAATTTCACAGCCTCTTGCCTGGGCGATCCCGAACATCGCACCGGA..."
    >>> subreads[0].InsertionQV()
    array([12, 12, 10,  2,  7, 14, 13, 18, 15, 16, 16, 15, 10, 12,  3, 14, ...])
    

Note that not every ZMW on a chip produces usable sequencing data. The BasH5Reader has a property sequencingZmws is a list of the hole numbers where usable sequence was recorded. Iteration over the BasH5Reader object allows you to iterate over the Zmw objects providing usable sequence.

class CCSZmwRead(baxH5, holeNumber, readStart, readEnd)

Class providing access to the CCS (circular consensus sequencing) data calculated for a ZMW.

readName
class BasH5Reader.Zmw(baxH5, holeNumber)

A Zmw represents all data from a ZMW (zero-mode waveguide) hole within a bas.h5 movie file. Accessor methods provide convenient access to the read (or subreads), and to the region table entries for this hole.

adapterRegions

Get adapter regions as intervals, performing clipping to the HQ region

adapterRegionsNoQC

Get adapter regions as intervals, without clipping to the HQ region. Don’t use this unless you know what you’re doing.

adapters

Get the adapter hits as a list of ZmwRead objects. Restricts focus, and clips to, the HQ region. This method can be used by production code.

adaptersNoQC

Get the adapters, including data beyond the bounds of the HQ region.

Warning

It is not recommended that production code use this method as we make no guarantees about what happens outside of the HQ region.

baxH5
ccsRead
holeNumber
hqRegion

Return the HQ region interval.

The HQ region is an interval of basecalls where the basecaller has inferred that a single sequencing reaction is taking place. Secondary analysis should only use subreads within the HQ region. Methods in this class, with the exception of the “NoQC” methods, return data appropriately clipped/filtered to the HQ region.

hqRegionSnr

Return the SNRs, as a vector by channel.

index
insertRegions

Get insert regions as intervals, clipped to the HQ region

insertRegionsNoQC

Get insert regions as intervals, without clipping to the HQ region. Don’t use this unless you know what you’re doing.

listZmwMetrics()

List the available ZMW metrics for this bax.h5 file.

numPasses

Return the number of passes (forward + back) across the SMRTbell insert, used to forming the CCS consensus.

productivity

Return the ‘productivity’ of this ZMW, which is the estimated number of polymerase reactions taking place within it. For example, a doubly-loaded ZMW would have productivity 2.

read(readStart=None, readEnd=None)

Given no arguments, returns the entire (HQ-clipped) polymerase read. With readStart, readEnd arguments, returns the specified extent of the polymerase read.

readNoQC(readStart=None, readEnd=None)

Given no arguments, returns the entire polymerase read, not HQ-clipped. With readStart, readEnd arguments, returns the specified extent of the polymerase read.

Warning

It is not recommended that production code use this method as we make no guarantees about what happens outside of the HQ region.

readScore

Return the “read score”, a prediction of the accuracy (between 0 and 1) of the basecalls from this ZMW, from the ReadScore dataset in the file

regionTable
subreads

Get the subreads as a list of ZmwRead objects. Restricts focus, and clips to, the HQ region. This method can be used by production code.

subreadsNoQC

Get the subreads, including data beyond the bounds of the HQ region.

Warning

It is not recommended that production code use this method as we make no guarantees about what happens outside of the HQ region.

zmwMetric(name)

Return the value of metric ‘name’ from the ZMW metrics.

zmwName
class BasH5Reader.ZmwRead(baxH5, holeNumber, readStart, readEnd)

A ZmwRead represents the data features (basecalls as well as pulse features) recorded from the ZMW, delimited by readStart and readEnd.

DeletionQV()
DeletionTag()
IPD()
InsertionQV()
MergeQV()
PreBaseFrames()
PulseWidth()
QualityValue()
SubstitutionQV()
SubstitutionTag()
WidthInFrames()
basecalls()
baxH5
holeNumber
offsetBegin
offsetEnd
qv(qvName)
readEnd
readName
readStart
zmw
BasH5Reader.allSequencingZmws
BasH5Reader.ccsReads()
BasH5Reader.chemistryBarcodeTriple
BasH5Reader.close()
BasH5Reader.hasConsensusBasecalls
BasH5Reader.hasRawBasecalls
BasH5Reader.movieName
BasH5Reader.parts
BasH5Reader.reads()
BasH5Reader.sequencingChemistry
BasH5Reader.sequencingZmws
BasH5Reader.subreads()
class pbcore.io.BasH5IO.Zmw(baxH5, holeNumber)

A Zmw represents all data from a ZMW (zero-mode waveguide) hole within a bas.h5 movie file. Accessor methods provide convenient access to the read (or subreads), and to the region table entries for this hole.

adapterRegions

Get adapter regions as intervals, performing clipping to the HQ region

adapterRegionsNoQC

Get adapter regions as intervals, without clipping to the HQ region. Don’t use this unless you know what you’re doing.

adapters

Get the adapter hits as a list of ZmwRead objects. Restricts focus, and clips to, the HQ region. This method can be used by production code.

adaptersNoQC

Get the adapters, including data beyond the bounds of the HQ region.

Warning

It is not recommended that production code use this method as we make no guarantees about what happens outside of the HQ region.

baxH5
ccsRead
holeNumber
hqRegion

Return the HQ region interval.

The HQ region is an interval of basecalls where the basecaller has inferred that a single sequencing reaction is taking place. Secondary analysis should only use subreads within the HQ region. Methods in this class, with the exception of the “NoQC” methods, return data appropriately clipped/filtered to the HQ region.

hqRegionSnr

Return the SNRs, as a vector by channel.

index
insertRegions

Get insert regions as intervals, clipped to the HQ region

insertRegionsNoQC

Get insert regions as intervals, without clipping to the HQ region. Don’t use this unless you know what you’re doing.

listZmwMetrics()

List the available ZMW metrics for this bax.h5 file.

numPasses

Return the number of passes (forward + back) across the SMRTbell insert, used to forming the CCS consensus.

productivity

Return the ‘productivity’ of this ZMW, which is the estimated number of polymerase reactions taking place within it. For example, a doubly-loaded ZMW would have productivity 2.

read(readStart=None, readEnd=None)

Given no arguments, returns the entire (HQ-clipped) polymerase read. With readStart, readEnd arguments, returns the specified extent of the polymerase read.

readNoQC(readStart=None, readEnd=None)

Given no arguments, returns the entire polymerase read, not HQ-clipped. With readStart, readEnd arguments, returns the specified extent of the polymerase read.

Warning

It is not recommended that production code use this method as we make no guarantees about what happens outside of the HQ region.

readScore

Return the “read score”, a prediction of the accuracy (between 0 and 1) of the basecalls from this ZMW, from the ReadScore dataset in the file

regionTable
subreads

Get the subreads as a list of ZmwRead objects. Restricts focus, and clips to, the HQ region. This method can be used by production code.

subreadsNoQC

Get the subreads, including data beyond the bounds of the HQ region.

Warning

It is not recommended that production code use this method as we make no guarantees about what happens outside of the HQ region.

zmwMetric(name)

Return the value of metric ‘name’ from the ZMW metrics.

zmwName
class pbcore.io.BasH5IO.ZmwRead(baxH5, holeNumber, readStart, readEnd)

A ZmwRead represents the data features (basecalls as well as pulse features) recorded from the ZMW, delimited by readStart and readEnd.

DeletionQV()
DeletionTag()
IPD()
InsertionQV()
MergeQV()
PreBaseFrames()
PulseWidth()
QualityValue()
SubstitutionQV()
SubstitutionTag()
WidthInFrames()
basecalls()
baxH5
holeNumber
offsetBegin
offsetEnd
qv(qvName)
readEnd
readName
readStart
zmw

BAM format

The BAM format is a standard format described aligned and unaligned reads. PacBio is transitioning from the cmp.h5 format to the BAM format. For basic functionality, one should use BamReader; for full compatibility with the CmpH5Reader API (including alignment index functionality) one should use IndexedBamReader, which requires the auxiliary PacBio BAM index file (bam.pbi file).

class pbcore.io.BamAlignment(bamReader, pysamAlignedRead, rowNumber=None)
DeletionQV(aligned=True, orientation='native')
DeletionTag(aligned=True, orientation='native')
HoleNumber
IPD(aligned=True, orientation='native')
InsertionQV(aligned=True, orientation='native')
MapQV
MergeQV(aligned=True, orientation='native')
PulseWidth(aligned=True, orientation='native')
SubstitutionQV(aligned=True, orientation='native')
barcode
barcodeName
clippedTo(bamAln, *args, **kwargs)

Return a new BamAlignment that refers to a subalignment of this alignment, as induced by clipping to reference coordinates refStart to refEnd.

Warning

This function takes time linear in the length of the alignment.

identity
isCCS
isForwardStrand
isMapped
isReverseStrand
isUnmapped
mapQV
movieName
numPasses
pulseFeature(featureName, aligned=True, orientation='native')

Retrieve the pulse feature as indicated. - aligned : whether gaps should be inserted to reflect the alignment - orientation: “native” or “genomic”

Note that this function assumes the the feature is stored in native orientation in the file, so it is not appropriate to use this method to fetch the read or the qual, which are oriented genomically in the file.

qEnd
qId
qLen
qName
qStart
queryEnd
queryName
queryStart
read(aligned=True, orientation='native')
readGroupInfo
readPositions(aligned=True, orientation='native')

Returns an array of read positions.

If aligned is True, the array has the same length as the alignment and readPositions[i] = read position of the i’th column in the oriented alignment.

If aligned is False, the array has the same length as the mapped reference segment and readPositions[i] = read position of the i’th base in the oriented reference segment.

readType
reader
reference(bamAln, *args, **kwargs)
referenceId
referenceInfo
referenceName
referencePositions(bamAln, *args, **kwargs)

Returns an array of reference positions.

If aligned is True, the array has the same length as the alignment and referencePositions[i] = reference position of the i’th column in the oriented alignment.

If aligned is False, the array has the same length as the read and referencePositions[i] = reference position of the i’th base in the oriented read.

sequencingChemistry
tId
transcript(orientation='native', style='gusfield')

A text representation of the alignment moves (see Gusfield). This can be useful in pretty-printing an alignment.

unrolledCigar(bamAln, *args, **kwargs)

Run-length decode the CIGAR encoding, and orient. Clipping ops are removed.

zScore
class pbcore.io.BamReader(fname, referenceFastaFname=None)

Reader for a BAM with a bam.bai (SAMtools) index, but not a bam.pbi (PacBio) index. Supports basic BAM operations.

index
readsInRange(winId, winStart, winEnd, justIndices=False)
class pbcore.io.IndexedBamReader(fname, referenceFastaFname=None, sharedIndex=None)

A IndexedBamReader is a BAM reader class that uses the bam.pbi (PacBio BAM index) file to enable random access by “row number” and to provide access to precomputed semantic information about the BAM records

atRowNumber(rn)
index
readsInRange(winId, winStart, winEnd, justIndices=False)

cmp.h5 format (legacy PacBio alignment file)

The cmp.h5 file format is an alignment format built on top of the HDF5 standard. It is a simple container format for PacBio alignment records.

Note

In contrast to GFF, for example, all cmp.h5 coordinate systems (refererence, read) are 0-based and start-inclusive/end-exclusive, i.e. the same convention as Python and the C++ STL.

class pbcore.io.CmpH5Reader(filenameOrH5File, sharedIndex=None)

The CmpH5Reader class is a lightweight and efficient API for accessing PacBio cmp.h5 alignment files. Alignment records can be obtained via random access (via Python indexing/slicing), iteration, or range queries (via readsInRange).

>>> import pbcore.data                # For an example data file
>>> from pbcore.io import CmpH5Reader
>>> filename = pbcore.data.getCmpH5()
>>> c = CmpH5Reader(filename)
>>> c[0]
CmpH5 alignment: -    1          0        290
>>> c[0:2]  
[CmpH5 alignment: -    1          0        290,
 CmpH5 alignment: +    1          0        365]
>>> sum(aln.readLength for aln in c)
26103
ReadGroupID
alignmentIndex

Return the alignment index data structure, which is the central data structure in the cmp.h5 file, as a numpy recarray.

The dtype of the recarray is:

dtype([('AlnID', int),
       ('AlnGroupID', int),
       ('MovieID', int),
       ('RefGroupID', int),
       ('tStart', int),
       ('tEnd', int),
       ('RCRefStrand', int),
       ('HoleNumber', int),
       ('SetNumber', int),
       ('StrobeNumber', int),
       ('MoleculeID', int),
       ('rStart', int),
       ('rEnd', int),
       ('MapQV', int),
       ('nM', int),
       ('nMM', int),
       ('nIns', int),
       ('nDel', int),
       ('Offset_begin', int),
       ('Offset_end', int),
       ('nBackRead', int),
       ('nReadOverlap', int)])

Access to the alignment index is provided to allow users to perform vectorized computations over all alignments in the file.

>>> c.alignmentIndex.MapQV[0:10]
array([254, 254,   0, 254, 254, 254, 254, 254, 254, 254], dtype=uint32)

Alignment index fields are also exposed as fields of the CmpH5Reader object, allowing a convenient shorthand.

>>> c.MapQV[0:10]
array([254, 254,   0, 254, 254, 254, 254, 254, 254, 254], dtype=uint32)

The alignment index row for a given alignment can also be accessed directly as a field of a CmpH5Alignment object

>>> c[26].MapQV
254
barcode

Returns a dict mapping of barcode name to integer barcode. Behavior undefined if file is not barcoded.

barcodeName

Returns a dict mapping of barcode integer id to name. Behavior undefined if file is not barcoded.

barcodes

Returns an array of barcode integer ids, of the same length as the alignment array.

Behavior undefined if file is not barcoded.

close()
hasPulseFeature(featureName)

Are the datasets for pulse feature featureName loaded in this file? Specifically, is it loaded for all movies within this cmp.h5?

>>> c.hasPulseFeature("InsertionQV")
True
>>> c.hasPulseFeature("MergeQV")
False
holeNumber
index
isBarcoded
isEmpty
isSorted
mapQV
movieInfo(movieId)

Deprecated since version 0.9.2: Use readGroupInfo, which is compatible with BAM usage

Access information about a movie whose reads are represented in the file.

The returned value is a record from the movieInfoTable

movieInfoTable

Deprecated since version 0.9.2: Use readGroupTable, which is compatible with BAM usage

Return a numpy recarray summarizing source movies for the reads in this file.

The dtype of this recarray is:

dtype([('ID', 'int'),
       ('Name', 'string'),
       ('FrameRate', 'float'),
       ('TimeScale', 'float')])

TimeScale is the factor to multiply time values (IPD, PulseWidth) by in order to get times in seconds. The FrameRate field should not be used directly as it will be NaN for pre-1.3 cmp.h5 files.

movieNames
pulseFeaturesAvailable()

What pulse features are available in this cmp.h5 file?

>>> c.pulseFeaturesAvailable()
[u'QualityValue', u'IPD', u'PulseWidth', u'InsertionQV', u'DeletionQV']
qId
readGroupInfo(rgId)

Access information about a movie whose reads are represented in the file.

The returned value is a record from the readGroupTable

readGroupTable
readType

Either “standard” or “CCS”, indicating the type of reads that were aligned to the reference.

>>> c.readType
'standard'
readsInRange(refKey, refStart, refEnd, justIndices=False)

Get a list of reads overlapping (i.e., intersecting—not necessarily spanning) a given reference window.

If justIndices is False, the list returned will contain CmpH5Alignment objects.

If justIndices is True, the list returned will contain row numbers in the alignment index table. Slicing the CmpH5Reader object with these row numbers can be used to get the corresponding CmpH5Alignment objects.

The contig key can be either the RefID, or the short name (FASTA header up to first space).

>>> c.readsInRange(1, 0, 1000) 
[CmpH5 alignment: -    1          0        290,
 CmpH5 alignment: +    1          0        365]

>>> rowNumbers = c.readsInRange(1, 0, 1000, justIndices=True)
>>> rowNumbers
array([0, 1], dtype=uint32)
referenceInfo(key)

Access information about a reference that was aligned against. Key can be reference ID (integer), name (“ref000001”), full name (e.g. “lambda_NEB3011”), truncated full name (full name up to the first whitespace, following the samtools convention) or MD5 sum hex string (e.g. “a1319ff90e994c8190a4fe6569d0822a”).

The returned value is a record from the :ref:referenceInfoTable .

>>> ri = c.referenceInfo("ref000001")
>>> ri.FullName
'lambda_NEB3011'
>>> ri.MD5
'a1319ff90e994c8190a4fe6569d0822a'
referenceInfoTable

Return a numpy recarray summarizing the references that were aligned against.

The dtype of this recarray is:

dtype([('RefInfoID', int),
       ('ID', int),
       ('Name', string),
       ('FullName', string),
       ('Length', int),
       ('MD5', string),
       ('StartRow', int),
       ('EndRow', int) ])

(the last two columns are omitted for unsorted cmp.h5 files).

sequencingChemistry
softwareVersion(programName)

Return the version of program programName that processed this file.

version

The CmpH5 format version string.

>>> c.version
'1.2.0.SF'
versionAtLeast(minimalVersion)

Compare the file version to minimalVersion.

>>> c.versionAtLeast("1.3.0")
False
class pbcore.io.CmpH5Alignment(cmph5, rowNumber)

A lightweight class representing a single alignment record in a CmpH5 file, providing access to all columns of a single row of the alignment index, and on-demand access to the corresponding sequence and pulse features.

CmpH5Alignment objects are obtained by slicing a CmpH5Reader object:

>>> c[26]
CmpH5 alignment: +    1       7441       7699

>>> print c[26]
CmpH5 alignment: +    1       7441       7699

Read:        m110818_075520_42141_c100129202555500000315043109121112_s2_p0/1009/44_322
Reference:   lambda_NEB3011

Read length: 278
Concordance: 0.871

  12345678901234567890123456789001223456789012345678890112345678990112344567890011
  AACTGGTCACGGTCGTGGCACTGGTGAAG-CT-GCATACTGATGCACTT-CAC-GCCACGCG-GG-ATG-AACCTG-T-G
  |||||||  ||||  ||||||||| |||| || ||||||||| |||||  ||| |||||||| || ||| |||||| | |
  AACTGGT--CGGT--TGGCACTGG-GAAGCCTTGCATACTGA-GCACT-CCACGGCCACGCGGGGAATGAAACCTGGTGG


  23456789012345678900123456678901234456789012345678901234566789011234556789012345
  GCATTTGTGCTGCCGGGA-ACGGCG-TTTCGTGT-CTCTGCCGGTGTGGCAGCCGAA-ATGAC-AGAG-CGCGGCCTGGC
  |||||||||||||||||| |||||| |||||| | |||||||||||||||||||||| ||||| |||| |||||||||||
  GCATTTGTGCTGCCGGGAAACGGCGTTTTCGT-TCCTCTGCCGGTGTGGCAGCCGAAAATGACAAGAGCCGCGGCCTGGC


  67899012345677890123456789901123456789901233456789012345678901234556778901223455
  CAG-AATGCAAT-AACGGGAGGCGC-TG-TGGCTGAT-TTCG-ATAACCTGTTCGATGCTGCCAT-TG-CCCGC-GCC-G
  ||| |||||||| |||||||||||| || |||||||| |||| |||||||||||||||||||||| || ||||| ||| |
  CAGAAATGCAATAAACGGGAGGCGCTTGCTGGCTGATTTTCGAATAACCTGTTCGATGCTGCCATTTGACCCGCGGCCGG


  6678901234567889012345667890123456789012345678
  -ATGAAACGATAC-GCGGGTAC-ATGGGAACGTCAGCCACCATTAC
   |||||||||||| |||||||| |||||||||||||||||||||||
  AATGAAACGATACGGCGGGTACAATGGGAACGTCAGCCACCATTAC

The orientation argument to some data access methods determines how reverse-strand alignments are returned to the user. .cmp.h5 files natively encode reverse strand reads in read-order, uncomplemented, with the reference reverse-complemented. Most analysis applications will want to use the data in this order, which we term the NATIVE orientation.

Some applications that involve collating or displaying the reads aligned to the reference genome want the reference to be presented in its genomic order, and the read to be reverse-complemented. We term this GENOMIC orientation.

Methods prefixed with aligned return data (bases or features) that include gaps, which are encoded according to the data type. Methods not prefixed with aligned excise the gaps.

Sequence and pulse features are not cached.

DeletionQV(aligned=True, orientation='native')
DeletionTag(aligned=True, orientation='native')
IPD(aligned=True, orientation='native')
InsertionQV(aligned=True, orientation='native')
MergeQV(aligned=True, orientation='native')
PulseWidth(aligned=True, orientation='native')
QualityValue(aligned=True, orientation='native')
ReadGroupID
SubstitutionQV(aligned=True, orientation='native')
aEnd
aStart
accuracy

Return the identity of this alignment, calculated as (#matchs / read length)

Deprecated since version 0.9.5: Use identity

alignmentArray(orientation='native')

Direct access to the raw, encoded aligment array, which is a packed representation of the aligned read and reference.

barcode

The barcode ID (integer key) for this alignment’s read Behavior undefined if file is not barcoded.

barcodeName

The barcode name (string) for this alignment’s read Behavior undefined if file is not barcoded.

clippedTo(refStart, refEnd)

Return a new CmpH5Alignment that refers to a subalignment of this alignment, as induced by clipping to reference coordinates refStart to refEnd.

Warning

This function takes time linear in the length of the alignment.

cmpH5
holeNumber
identity

Return the identity of this alignment, calculated as (#matchs / read length)

>>> c[26].identity
0.87050359712230219
isForwardStrand
isReverseStrand
mapQV
movieInfo

Deprecated since version 0.9.2: Use readGroupInfo, which is compatible with BAM usage

Returns a record (extracted from the cmph5’s movieInfoTable) containing information about the movie that the read was extracted from. This record should be accessed using dot-notation, according to the column names documented in movieInfoTable.

movieName
numPasses

(CCS only) The number subreads that were used to produce this CCS read.

pulseFeature(featureName, aligned=True, orientation='native')

Access a pulse feature by name.

qId
read(aligned=True, orientation='native')

Return the read portion of the alignment as a string.

If aligned is true, the aligned representation is returned, including gaps; otherwise the unaligned read basecalls are returned.

If orientation is “native”, the returned read bases are presented in the order they were read by the sequencing machine. If orientation is “genomic”, the returned read bases are presented in such a way as to collate with the forward strand of the reference—which requires reverse complementation of reverse-strand reads.

readGroupInfo

Returns the corresponding record from the readGroupTable.

readPositions(orientation='native')

Returns an array of read positions such that readPositions[i] = read position of the i’th column in the alignment. Insertions are grouped with the following read base, in the specified orientation.

Length of output array = length of alignment

readType
reader
reference(aligned=True, orientation='native')

Return the read portion of the alignment as a string.

If aligned is true, the aligned representation of the reference is returned, including gaps; otherwise the unaligned reference bases are returned.

If orientation is “native”, the reference is presented in the order it is stored in the cmp.h5 file—for reverse-strand reads, the reference is reverse-complemented. If orientation is “genomic”, the forward strand reference is returned.

referenceId
referenceInfo
referenceName
referencePositions(orientation='native')

Returns an array of reference positions such that referencePositions[i] = reference position of the i’th column in the alignment. Insertions are grouped with the following reference base, in the specified orientation.

Length of output array = length of alignment

rowNumber
sequencingChemistry
similarity

Replicates the pctsimilarity field from blasr, calculated as #matches/mean(aligned_length, read_length)

transcript(orientation='native', style='gusfield')

A text representation of the alignment moves (see Gusfield). This can be useful in pretty-printing an alignment.

zScore

(PacBio internal files only)

The z-score of the alignment, using a null model of a random sequence alignment.

FASTA Format

FASTA is a standard format for sequence data. We recommmend using the FastaTable class, which provides random access to indexed FASTA files (using the conventional SAMtools “fai” index).

pbcore.io.FastaTable

alias of IndexedFastaReader

class pbcore.io.FastaRecord(header, sequence)

A FastaRecord object models a named sequence in a FASTA file.

comment

The comment associated with the sequence in the FASTA file, equal to the contents of the FASTA header following the first whitespace

classmethod fromString(s)

Interprets a string as a FASTA record. Does not make any assumptions about wrapping of the sequence string.

header

The header of the sequence in the FASTA file, equal to the entire first line of the FASTA record following the ‘>’ character.

Warning

You should almost certainly be using “id”, not “header”.

id

The id of the sequence in the FASTA file, equal to the FASTA header up to the first whitespace.

length

Get the length of the FASTA sequence

name

DEPRECATED: The name of the sequence in the FASTA file, equal to the entire FASTA header following the ‘>’ character

reverseComplement(preserveHeader=False)

Return a new FastaRecord with the reverse-complemented DNA sequence. Optionally, supply a name

sequence

The sequence for the record as present in the FASTA file. (Newlines are removed but otherwise no sequence normalization is performed).

class pbcore.io.FastaReader(f)

Streaming reader for FASTA files, useable as a one-shot iterator over FastaRecord objects. Agnostic about line wrapping.

Example:

>>> from pbcore.io import FastaReader
>>> from pbcore import data
>>> filename = data.getTinyFasta()
>>> r = FastaReader(filename)
>>> for record in r:
...     print record.header, len(record.sequence)
ref000001|EGFR_Exon_2 183
ref000002|EGFR_Exon_3 203
ref000003|EGFR_Exon_4 215
ref000004|EGFR_Exon_5 157
>>> r.close()
class pbcore.io.FastaWriter(f)

A FASTA file writer class

Example:

>>> from pbcore.io import FastaWriter
>>> with FastaWriter("output.fasta.gz") as writer:
...     writer.writeRecord("dog", "GATTACA")
...     writer.writeRecord("cat", "CATTACA")

(Notice that underlying file will be automatically closed after exit from the with block.)

writeRecord(*args)

Write a FASTA record to the file. If given one argument, it is interpreted as a FastaRecord. Given two arguments, they are interpreted as the name and the sequence.

FASTQ Format

FASTQ is a standard format for sequence data with associated quality scores.

class pbcore.io.FastqRecord(header, sequence, quality=None, qualityString=None)

A FastqRecord object models a named sequence and its quality values in a FASTQ file. For reference consult Wikipedia’s FASTQ entry. We adopt the Sanger encoding convention, allowing the encoding of QV values in [0, 93] using ASCII 33 to 126. We only support FASTQ files in the four-line convention (unwrapped). Wrapped FASTQ files are generally considered a bad idea as the @, + delimiters can also appear in the quality string, thus parsing cannot be done safely.

comment

The comment associated with the sequence in the FASTQ file, equal to the contents of the FASTQ header following the first whitespace

classmethod fromString(s)

Interprets a string as a FASTQ record. Only supports four-line format, as wrapped FASTQs can’t easily be safely parsed.

header

The header of the sequence in the FASTQ file

id

The id of the sequence in the FASTQ file, equal to the FASTQ header up to the first whitespace.

length

The length of the sequence

name

DEPRECATED: The name of the sequence in the FASTQ file

quality

The quality values, as an array of integers

qualityString

The quality values as an ASCII-encoded string

reverseComplement(preserveHeader=False)

Return a new FastaRecord with the reverse-complemented DNA sequence. Optionally, supply a name

sequence

The sequence for the record as present in the FASTQ file.

class pbcore.io.FastqReader(f)

Reader for FASTQ files, useable as a one-shot iterator over FastqRecord objects. FASTQ files must follow the four-line convention.

class pbcore.io.FastqWriter(f)

A FASTQ file writer class

Example:

>>> from pbcore.io import FastqWriter
>>> with FastqWriter("output.fq.gz") as writer:
...     writer.writeRecord("dog", "GATTACA", [35]*7)
...     writer.writeRecord("cat", "CATTACA", [35]*7)

(Notice that underlying file will be automatically closed after exit from the with block.)

writeRecord(*args)

Write a FASTQ record to the file. If given one argument, it is interpreted as a FastqRecord. Given three arguments, they are interpreted as the name, sequence, and quality.

GFF Format (Version 3)

The GFF format is an open and flexible standard for representing genomic features.

class pbcore.io.Gff3Record(seqid, start, end, type, score='.', strand='.', phase='.', source='.', attributes=())

Class for GFF record, providing uniform access to standard GFF fields and attributes.

>>> from pbcore.io import Gff3Record
>>> record = Gff3Record("chr1", 10, 11, "insertion",
...                     attributes=[("foo", "1"), ("bar", "2")])
>>> record.start
10
>>> record.foo
'1'
>>> record.baz = 3
>>> del record.baz

Attribute access using record.fieldName notation raises ValueError if an attribute named fieldName doesn’t exist. Use:

>>> record.get(fieldName)

to fetch a field or attribute with None default or:

>>> record.get(fieldName, defaultValue)

to fetch the field or attribute with a custom default.

copy()

Return a shallow copy

classmethod fromString(s)

Parse a string as a GFF record. Trailing whitespace is ignored.

class pbcore.io.GffReader(f)

A GFF file reader class

class pbcore.io.GffWriter(f)

A GFF file writer class