pbcore.io.align package

Submodules

pbcore.io.align.BamAlignment module

class pbcore.io.align.BamAlignment.BamAlignment(bamReader, pysamAlignedRead, rowNumber=None)

Bases: pbcore.io.align._AlignmentMixin.AlignmentRecordMixin

DeletionQV(aligned=True, orientation='native')
DeletionTag(aligned=True, orientation='native')
HoleNumber
IPD(aligned=True, orientation='native')
InsertionQV(aligned=True, orientation='native')
MapQV
MergeQV(aligned=True, orientation='native')
PulseWidth(aligned=True, orientation='native')
SubstitutionQV(aligned=True, orientation='native')
barcode
barcodeName
clippedTo(bamAln, *args, **kwargs)

Return a new BamAlignment that refers to a subalignment of this alignment, as induced by clipping to reference coordinates refStart to refEnd.

Warning

This function takes time linear in the length of the alignment.

identity
isCCS
isForwardStrand
isMapped
isReverseStrand
isUnmapped
mapQV
movieName
numPasses
pulseFeature(featureName, aligned=True, orientation='native')

Retrieve the pulse feature as indicated. - aligned : whether gaps should be inserted to reflect the alignment - orientation: “native” or “genomic”

Note that this function assumes the the feature is stored in native orientation in the file, so it is not appropriate to use this method to fetch the read or the qual, which are oriented genomically in the file.

qEnd
qId
qLen
qName
qStart
queryEnd
queryName
queryStart
read(aligned=True, orientation='native')
readGroupInfo
readPositions(aligned=True, orientation='native')

Returns an array of read positions.

If aligned is True, the array has the same length as the alignment and readPositions[i] = read position of the i’th column in the oriented alignment.

If aligned is False, the array has the same length as the mapped reference segment and readPositions[i] = read position of the i’th base in the oriented reference segment.

readType
reader
reference(bamAln, *args, **kwargs)
referenceId
referenceInfo
referenceName
referencePositions(bamAln, *args, **kwargs)

Returns an array of reference positions.

If aligned is True, the array has the same length as the alignment and referencePositions[i] = reference position of the i’th column in the oriented alignment.

If aligned is False, the array has the same length as the read and referencePositions[i] = reference position of the i’th base in the oriented read.

sequencingChemistry
tId
transcript(orientation='native', style='gusfield')

A text representation of the alignment moves (see Gusfield). This can be useful in pretty-printing an alignment.

unrolledCigar(bamAln, *args, **kwargs)

Run-length decode the CIGAR encoding, and orient. Clipping ops are removed.

zScore

pbcore.io.align.BamIO module

class pbcore.io.align.BamIO.BamReader(fname, referenceFastaFname=None)

Bases: pbcore.io.align.BamIO._BamReaderBase, pbcore.io.align._AlignmentMixin.AlignmentReaderMixin

Reader for a BAM with a bam.bai (SAMtools) index, but not a bam.pbi (PacBio) index. Supports basic BAM operations.

index
readsInRange(winId, winStart, winEnd, justIndices=False)
class pbcore.io.align.BamIO.IndexedBamReader(fname, referenceFastaFname=None, sharedIndex=None)

Bases: pbcore.io.align.BamIO._BamReaderBase, pbcore.io.align._AlignmentMixin.IndexedAlignmentReaderMixin

A IndexedBamReader is a BAM reader class that uses the bam.pbi (PacBio BAM index) file to enable random access by “row number” and to provide access to precomputed semantic information about the BAM records

atRowNumber(rn)
index
readsInRange(winId, winStart, winEnd, justIndices=False)

pbcore.io.align.BlasrIO module

class pbcore.io.align.BlasrIO.M4Record

Bases: object

Record for alignment summary record output from BLASR -m 4 option

classmethod fromString(s)
class pbcore.io.align.BlasrIO.M4Reader(f)

Bases: pbcore.io.base.ReaderBase

Reader for -m 4 formatted alignment summary information from BLASR

class pbcore.io.align.BlasrIO.M5Record

Bases: object

Record for alignment summary record output from BLASR -m 5 option

classmethod fromString(s)
class pbcore.io.align.BlasrIO.M5Reader(f)

Bases: pbcore.io.base.ReaderBase

Reader for -m 5 formatted alignment summary information from BLASR

pbcore.io.align.CmpH5IO module

class pbcore.io.align.CmpH5IO.CmpH5Reader(filenameOrH5File, sharedIndex=None)

Bases: pbcore.io.base.ReaderBase, pbcore.io.align._AlignmentMixin.IndexedAlignmentReaderMixin

The CmpH5Reader class is a lightweight and efficient API for accessing PacBio cmp.h5 alignment files. Alignment records can be obtained via random access (via Python indexing/slicing), iteration, or range queries (via readsInRange).

>>> import pbcore.data                # For an example data file
>>> from pbcore.io import CmpH5Reader
>>> filename = pbcore.data.getCmpH5()
>>> c = CmpH5Reader(filename)
>>> c[0]
CmpH5 alignment: -    1          0        290
>>> c[0:2]  
[CmpH5 alignment: -    1          0        290,
 CmpH5 alignment: +    1          0        365]
>>> sum(aln.readLength for aln in c)
26103
ReadGroupID
alignmentIndex

Return the alignment index data structure, which is the central data structure in the cmp.h5 file, as a numpy recarray.

The dtype of the recarray is:

dtype([('AlnID', int),
       ('AlnGroupID', int),
       ('MovieID', int),
       ('RefGroupID', int),
       ('tStart', int),
       ('tEnd', int),
       ('RCRefStrand', int),
       ('HoleNumber', int),
       ('SetNumber', int),
       ('StrobeNumber', int),
       ('MoleculeID', int),
       ('rStart', int),
       ('rEnd', int),
       ('MapQV', int),
       ('nM', int),
       ('nMM', int),
       ('nIns', int),
       ('nDel', int),
       ('Offset_begin', int),
       ('Offset_end', int),
       ('nBackRead', int),
       ('nReadOverlap', int)])

Access to the alignment index is provided to allow users to perform vectorized computations over all alignments in the file.

>>> c.alignmentIndex.MapQV[0:10]
array([254, 254,   0, 254, 254, 254, 254, 254, 254, 254], dtype=uint32)

Alignment index fields are also exposed as fields of the CmpH5Reader object, allowing a convenient shorthand.

>>> c.MapQV[0:10]
array([254, 254,   0, 254, 254, 254, 254, 254, 254, 254], dtype=uint32)

The alignment index row for a given alignment can also be accessed directly as a field of a CmpH5Alignment object

>>> c[26].MapQV
254
barcode

Returns a dict mapping of barcode name to integer barcode. Behavior undefined if file is not barcoded.

barcodeName

Returns a dict mapping of barcode integer id to name. Behavior undefined if file is not barcoded.

barcodes

Returns an array of barcode integer ids, of the same length as the alignment array.

Behavior undefined if file is not barcoded.

close()
hasPulseFeature(featureName)

Are the datasets for pulse feature featureName loaded in this file? Specifically, is it loaded for all movies within this cmp.h5?

>>> c.hasPulseFeature("InsertionQV")
True
>>> c.hasPulseFeature("MergeQV")
False
holeNumber
index
isBarcoded
isEmpty
isSorted
mapQV
movieInfo(movieId)

Deprecated since version 0.9.2: Use readGroupInfo, which is compatible with BAM usage

Access information about a movie whose reads are represented in the file.

The returned value is a record from the movieInfoTable

movieInfoTable

Deprecated since version 0.9.2: Use readGroupTable, which is compatible with BAM usage

Return a numpy recarray summarizing source movies for the reads in this file.

The dtype of this recarray is:

dtype([('ID', 'int'),
       ('Name', 'string'),
       ('FrameRate', 'float'),
       ('TimeScale', 'float')])

TimeScale is the factor to multiply time values (IPD, PulseWidth) by in order to get times in seconds. The FrameRate field should not be used directly as it will be NaN for pre-1.3 cmp.h5 files.

movieNames
pulseFeaturesAvailable()

What pulse features are available in this cmp.h5 file?

>>> c.pulseFeaturesAvailable()
[u'QualityValue', u'IPD', u'PulseWidth', u'InsertionQV', u'DeletionQV']
qId
readGroupInfo(rgId)

Access information about a movie whose reads are represented in the file.

The returned value is a record from the readGroupTable

readGroupTable
readType

Either “standard” or “CCS”, indicating the type of reads that were aligned to the reference.

>>> c.readType
'standard'
readsInRange(refKey, refStart, refEnd, justIndices=False)

Get a list of reads overlapping (i.e., intersecting—not necessarily spanning) a given reference window.

If justIndices is False, the list returned will contain CmpH5Alignment objects.

If justIndices is True, the list returned will contain row numbers in the alignment index table. Slicing the CmpH5Reader object with these row numbers can be used to get the corresponding CmpH5Alignment objects.

The contig key can be either the RefID, or the short name (FASTA header up to first space).

>>> c.readsInRange(1, 0, 1000) 
[CmpH5 alignment: -    1          0        290,
 CmpH5 alignment: +    1          0        365]

>>> rowNumbers = c.readsInRange(1, 0, 1000, justIndices=True)
>>> rowNumbers
array([0, 1], dtype=uint32)
referenceInfo(key)

Access information about a reference that was aligned against. Key can be reference ID (integer), name (“ref000001”), full name (e.g. “lambda_NEB3011”), truncated full name (full name up to the first whitespace, following the samtools convention) or MD5 sum hex string (e.g. “a1319ff90e994c8190a4fe6569d0822a”).

The returned value is a record from the :ref:referenceInfoTable .

>>> ri = c.referenceInfo("ref000001")
>>> ri.FullName
'lambda_NEB3011'
>>> ri.MD5
'a1319ff90e994c8190a4fe6569d0822a'
referenceInfoTable

Return a numpy recarray summarizing the references that were aligned against.

The dtype of this recarray is:

dtype([('RefInfoID', int),
       ('ID', int),
       ('Name', string),
       ('FullName', string),
       ('Length', int),
       ('MD5', string),
       ('StartRow', int),
       ('EndRow', int) ])

(the last two columns are omitted for unsorted cmp.h5 files).

sequencingChemistry
softwareVersion(programName)

Return the version of program programName that processed this file.

version

The CmpH5 format version string.

>>> c.version
'1.2.0.SF'
versionAtLeast(minimalVersion)

Compare the file version to minimalVersion.

>>> c.versionAtLeast("1.3.0")
False
class pbcore.io.align.CmpH5IO.CmpH5Alignment(cmph5, rowNumber)

Bases: pbcore.io.align._AlignmentMixin.AlignmentRecordMixin

A lightweight class representing a single alignment record in a CmpH5 file, providing access to all columns of a single row of the alignment index, and on-demand access to the corresponding sequence and pulse features.

CmpH5Alignment objects are obtained by slicing a CmpH5Reader object:

>>> c[26]
CmpH5 alignment: +    1       7441       7699

>>> print c[26]
CmpH5 alignment: +    1       7441       7699

Read:        m110818_075520_42141_c100129202555500000315043109121112_s2_p0/1009/44_322
Reference:   lambda_NEB3011

Read length: 278
Concordance: 0.871

  12345678901234567890123456789001223456789012345678890112345678990112344567890011
  AACTGGTCACGGTCGTGGCACTGGTGAAG-CT-GCATACTGATGCACTT-CAC-GCCACGCG-GG-ATG-AACCTG-T-G
  |||||||  ||||  ||||||||| |||| || ||||||||| |||||  ||| |||||||| || ||| |||||| | |
  AACTGGT--CGGT--TGGCACTGG-GAAGCCTTGCATACTGA-GCACT-CCACGGCCACGCGGGGAATGAAACCTGGTGG


  23456789012345678900123456678901234456789012345678901234566789011234556789012345
  GCATTTGTGCTGCCGGGA-ACGGCG-TTTCGTGT-CTCTGCCGGTGTGGCAGCCGAA-ATGAC-AGAG-CGCGGCCTGGC
  |||||||||||||||||| |||||| |||||| | |||||||||||||||||||||| ||||| |||| |||||||||||
  GCATTTGTGCTGCCGGGAAACGGCGTTTTCGT-TCCTCTGCCGGTGTGGCAGCCGAAAATGACAAGAGCCGCGGCCTGGC


  67899012345677890123456789901123456789901233456789012345678901234556778901223455
  CAG-AATGCAAT-AACGGGAGGCGC-TG-TGGCTGAT-TTCG-ATAACCTGTTCGATGCTGCCAT-TG-CCCGC-GCC-G
  ||| |||||||| |||||||||||| || |||||||| |||| |||||||||||||||||||||| || ||||| ||| |
  CAGAAATGCAATAAACGGGAGGCGCTTGCTGGCTGATTTTCGAATAACCTGTTCGATGCTGCCATTTGACCCGCGGCCGG


  6678901234567889012345667890123456789012345678
  -ATGAAACGATAC-GCGGGTAC-ATGGGAACGTCAGCCACCATTAC
   |||||||||||| |||||||| |||||||||||||||||||||||
  AATGAAACGATACGGCGGGTACAATGGGAACGTCAGCCACCATTAC

The orientation argument to some data access methods determines how reverse-strand alignments are returned to the user. .cmp.h5 files natively encode reverse strand reads in read-order, uncomplemented, with the reference reverse-complemented. Most analysis applications will want to use the data in this order, which we term the NATIVE orientation.

Some applications that involve collating or displaying the reads aligned to the reference genome want the reference to be presented in its genomic order, and the read to be reverse-complemented. We term this GENOMIC orientation.

Methods prefixed with aligned return data (bases or features) that include gaps, which are encoded according to the data type. Methods not prefixed with aligned excise the gaps.

Sequence and pulse features are not cached.

DeletionQV(aligned=True, orientation='native')
DeletionTag(aligned=True, orientation='native')
IPD(aligned=True, orientation='native')
InsertionQV(aligned=True, orientation='native')
MergeQV(aligned=True, orientation='native')
PulseWidth(aligned=True, orientation='native')
QualityValue(aligned=True, orientation='native')
ReadGroupID
SubstitutionQV(aligned=True, orientation='native')
aEnd
aStart
accuracy

Return the identity of this alignment, calculated as (#matchs / read length)

Deprecated since version 0.9.5: Use identity

alignmentArray(orientation='native')

Direct access to the raw, encoded aligment array, which is a packed representation of the aligned read and reference.

barcode

The barcode ID (integer key) for this alignment’s read Behavior undefined if file is not barcoded.

barcodeName

The barcode name (string) for this alignment’s read Behavior undefined if file is not barcoded.

clippedTo(refStart, refEnd)

Return a new CmpH5Alignment that refers to a subalignment of this alignment, as induced by clipping to reference coordinates refStart to refEnd.

Warning

This function takes time linear in the length of the alignment.

cmpH5
holeNumber
identity

Return the identity of this alignment, calculated as (#matchs / read length)

>>> c[26].identity
0.87050359712230219
isForwardStrand
isReverseStrand
mapQV
movieInfo

Deprecated since version 0.9.2: Use readGroupInfo, which is compatible with BAM usage

Returns a record (extracted from the cmph5’s movieInfoTable) containing information about the movie that the read was extracted from. This record should be accessed using dot-notation, according to the column names documented in movieInfoTable.

movieName
numPasses

(CCS only) The number subreads that were used to produce this CCS read.

pulseFeature(featureName, aligned=True, orientation='native')

Access a pulse feature by name.

qId
read(aligned=True, orientation='native')

Return the read portion of the alignment as a string.

If aligned is true, the aligned representation is returned, including gaps; otherwise the unaligned read basecalls are returned.

If orientation is “native”, the returned read bases are presented in the order they were read by the sequencing machine. If orientation is “genomic”, the returned read bases are presented in such a way as to collate with the forward strand of the reference—which requires reverse complementation of reverse-strand reads.

readGroupInfo

Returns the corresponding record from the readGroupTable.

readPositions(orientation='native')

Returns an array of read positions such that readPositions[i] = read position of the i’th column in the alignment. Insertions are grouped with the following read base, in the specified orientation.

Length of output array = length of alignment

readType
reader
reference(aligned=True, orientation='native')

Return the read portion of the alignment as a string.

If aligned is true, the aligned representation of the reference is returned, including gaps; otherwise the unaligned reference bases are returned.

If orientation is “native”, the reference is presented in the order it is stored in the cmp.h5 file—for reverse-strand reads, the reference is reverse-complemented. If orientation is “genomic”, the forward strand reference is returned.

referenceId
referenceInfo
referenceName
referencePositions(orientation='native')

Returns an array of reference positions such that referencePositions[i] = reference position of the i’th column in the alignment. Insertions are grouped with the following reference base, in the specified orientation.

Length of output array = length of alignment

rowNumber
sequencingChemistry
similarity

Replicates the pctsimilarity field from blasr, calculated as #matches/mean(aligned_length, read_length)

transcript(orientation='native', style='gusfield')

A text representation of the alignment moves (see Gusfield). This can be useful in pretty-printing an alignment.

zScore

(PacBio internal files only)

The z-score of the alignment, using a null model of a random sequence alignment.

exception pbcore.io.align.CmpH5IO.EmptyCmpH5Error

Bases: exceptions.Exception

An exception raised when CmpH5Reader tries to read from a cmp.h5 with no alignments.

pbcore.io.align.PacBioBamIndex module

class pbcore.io.align.PacBioBamIndex.PacBioBamIndex(pbiFilename)

Bases: object

The PacBio BAM index is a companion file allowing modest semantic queries on PacBio BAM files without iterating over the entire file. By convention, the PacBio BAM index has extension “bam.pbi”.

columnNames
rangeQuery(winId, winStart, winEnd)
version

Module contents