pbcore.io.align package¶
Submodules¶
pbcore.io.align.BamAlignment module¶
-
class
pbcore.io.align.BamAlignment.
BamAlignment
(bamReader, pysamAlignedRead, rowNumber=None)¶ Bases:
pbcore.io.align._AlignmentMixin.AlignmentRecordMixin
-
DeletionQV
(aligned=True, orientation='native')¶
-
DeletionTag
(aligned=True, orientation='native')¶
-
HoleNumber
¶
-
IPD
(aligned=True, orientation='native')¶
-
InsertionQV
(aligned=True, orientation='native')¶
-
MapQV
¶
-
MergeQV
(aligned=True, orientation='native')¶
-
PulseWidth
(aligned=True, orientation='native')¶
-
SubstitutionQV
(aligned=True, orientation='native')¶
-
barcode
¶
-
barcodeName
¶
-
clippedTo
(bamAln, *args, **kwargs)¶ Return a new BamAlignment that refers to a subalignment of this alignment, as induced by clipping to reference coordinates refStart to refEnd.
Warning
This function takes time linear in the length of the alignment.
-
identity
¶
-
isCCS
¶
-
isForwardStrand
¶
-
isMapped
¶
-
isReverseStrand
¶
-
isUnmapped
¶
-
mapQV
¶
-
movieName
¶
-
numPasses
¶
-
pulseFeature
(featureName, aligned=True, orientation='native')¶ Retrieve the pulse feature as indicated. - aligned : whether gaps should be inserted to reflect the alignment - orientation: “native” or “genomic”
Note that this function assumes the the feature is stored in native orientation in the file, so it is not appropriate to use this method to fetch the read or the qual, which are oriented genomically in the file.
-
qEnd
¶
-
qId
¶
-
qLen
¶
-
qName
¶
-
qStart
¶
-
queryEnd
¶
-
queryName
¶
-
queryStart
¶
-
read
(aligned=True, orientation='native')¶
-
readGroupInfo
¶
-
readPositions
(aligned=True, orientation='native')¶ Returns an array of read positions.
If aligned is True, the array has the same length as the alignment and readPositions[i] = read position of the i’th column in the oriented alignment.
If aligned is False, the array has the same length as the mapped reference segment and readPositions[i] = read position of the i’th base in the oriented reference segment.
-
readType
¶
-
reader
¶
-
reference
(bamAln, *args, **kwargs)¶
-
referenceId
¶
-
referenceInfo
¶
-
referenceName
¶
-
referencePositions
(bamAln, *args, **kwargs)¶ Returns an array of reference positions.
If aligned is True, the array has the same length as the alignment and referencePositions[i] = reference position of the i’th column in the oriented alignment.
If aligned is False, the array has the same length as the read and referencePositions[i] = reference position of the i’th base in the oriented read.
-
sequencingChemistry
¶
-
tId
¶
-
transcript
(orientation='native', style='gusfield')¶ A text representation of the alignment moves (see Gusfield). This can be useful in pretty-printing an alignment.
-
unrolledCigar
(bamAln, *args, **kwargs)¶ Run-length decode the CIGAR encoding, and orient. Clipping ops are removed.
-
zScore
¶
-
pbcore.io.align.BamIO module¶
-
class
pbcore.io.align.BamIO.
BamReader
(fname, referenceFastaFname=None)¶ Bases:
pbcore.io.align.BamIO._BamReaderBase
,pbcore.io.align._AlignmentMixin.AlignmentReaderMixin
Reader for a BAM with a bam.bai (SAMtools) index, but not a bam.pbi (PacBio) index. Supports basic BAM operations.
-
index
¶
-
readsInRange
(winId, winStart, winEnd, justIndices=False)¶
-
-
class
pbcore.io.align.BamIO.
IndexedBamReader
(fname, referenceFastaFname=None, sharedIndex=None)¶ Bases:
pbcore.io.align.BamIO._BamReaderBase
,pbcore.io.align._AlignmentMixin.IndexedAlignmentReaderMixin
A IndexedBamReader is a BAM reader class that uses the
bam.pbi
(PacBio BAM index) file to enable random access by “row number” and to provide access to precomputed semantic information about the BAM records-
atRowNumber
(rn)¶
-
index
¶
-
readsInRange
(winId, winStart, winEnd, justIndices=False)¶
-
pbcore.io.align.BlasrIO module¶
-
class
pbcore.io.align.BlasrIO.
M4Record
¶ Bases:
object
Record for alignment summary record output from BLASR -m 4 option
-
classmethod
fromString
(s)¶
-
classmethod
-
class
pbcore.io.align.BlasrIO.
M4Reader
(f)¶ Bases:
pbcore.io.base.ReaderBase
Reader for -m 4 formatted alignment summary information from BLASR
-
class
pbcore.io.align.BlasrIO.
M5Record
¶ Bases:
object
Record for alignment summary record output from BLASR -m 5 option
-
classmethod
fromString
(s)¶
-
classmethod
-
class
pbcore.io.align.BlasrIO.
M5Reader
(f)¶ Bases:
pbcore.io.base.ReaderBase
Reader for -m 5 formatted alignment summary information from BLASR
pbcore.io.align.CmpH5IO module¶
-
class
pbcore.io.align.CmpH5IO.
CmpH5Reader
(filenameOrH5File, sharedIndex=None)¶ Bases:
pbcore.io.base.ReaderBase
,pbcore.io.align._AlignmentMixin.IndexedAlignmentReaderMixin
The CmpH5Reader class is a lightweight and efficient API for accessing PacBio
cmp.h5
alignment files. Alignment records can be obtained via random access (via Python indexing/slicing), iteration, or range queries (via readsInRange).>>> import pbcore.data # For an example data file >>> from pbcore.io import CmpH5Reader >>> filename = pbcore.data.getCmpH5() >>> c = CmpH5Reader(filename) >>> c[0] CmpH5 alignment: - 1 0 290 >>> c[0:2] [CmpH5 alignment: - 1 0 290, CmpH5 alignment: + 1 0 365] >>> sum(aln.readLength for aln in c) 26103
-
ReadGroupID
¶
-
alignmentIndex
¶ Return the alignment index data structure, which is the central data structure in the cmp.h5 file, as a numpy recarray.
The dtype of the recarray is:
dtype([('AlnID', int), ('AlnGroupID', int), ('MovieID', int), ('RefGroupID', int), ('tStart', int), ('tEnd', int), ('RCRefStrand', int), ('HoleNumber', int), ('SetNumber', int), ('StrobeNumber', int), ('MoleculeID', int), ('rStart', int), ('rEnd', int), ('MapQV', int), ('nM', int), ('nMM', int), ('nIns', int), ('nDel', int), ('Offset_begin', int), ('Offset_end', int), ('nBackRead', int), ('nReadOverlap', int)])
Access to the alignment index is provided to allow users to perform vectorized computations over all alignments in the file.
>>> c.alignmentIndex.MapQV[0:10] array([254, 254, 0, 254, 254, 254, 254, 254, 254, 254], dtype=uint32)
Alignment index fields are also exposed as fields of the CmpH5Reader object, allowing a convenient shorthand.
>>> c.MapQV[0:10] array([254, 254, 0, 254, 254, 254, 254, 254, 254, 254], dtype=uint32)
The alignment index row for a given alignment can also be accessed directly as a field of a CmpH5Alignment object
>>> c[26].MapQV 254
-
barcode
¶ Returns a dict mapping of barcode name to integer barcode. Behavior undefined if file is not barcoded.
-
barcodeName
¶ Returns a dict mapping of barcode integer id to name. Behavior undefined if file is not barcoded.
-
barcodes
¶ Returns an array of barcode integer ids, of the same length as the alignment array.
Behavior undefined if file is not barcoded.
-
close
()¶
-
hasPulseFeature
(featureName)¶ Are the datasets for pulse feature featureName loaded in this file? Specifically, is it loaded for all movies within this cmp.h5?
>>> c.hasPulseFeature("InsertionQV") True >>> c.hasPulseFeature("MergeQV") False
-
holeNumber
¶
-
index
¶
-
isBarcoded
¶
-
isEmpty
¶
-
isSorted
¶
-
mapQV
¶
-
movieInfo
(movieId)¶ Deprecated since version 0.9.2: Use
readGroupInfo
, which is compatible with BAM usageAccess information about a movie whose reads are represented in the file.
The returned value is a record from the
movieInfoTable
-
movieInfoTable
¶ Deprecated since version 0.9.2: Use
readGroupTable
, which is compatible with BAM usageReturn a numpy recarray summarizing source movies for the reads in this file.
The dtype of this recarray is:
dtype([('ID', 'int'), ('Name', 'string'), ('FrameRate', 'float'), ('TimeScale', 'float')])
TimeScale is the factor to multiply time values (IPD, PulseWidth) by in order to get times in seconds. The FrameRate field should not be used directly as it will be NaN for pre-1.3 cmp.h5 files.
-
movieNames
¶
-
pulseFeaturesAvailable
()¶ What pulse features are available in this cmp.h5 file?
>>> c.pulseFeaturesAvailable() [u'QualityValue', u'IPD', u'PulseWidth', u'InsertionQV', u'DeletionQV']
-
qId
¶
-
readGroupInfo
(rgId)¶ Access information about a movie whose reads are represented in the file.
The returned value is a record from the
readGroupTable
-
readGroupTable
¶
-
readType
¶ Either “standard” or “CCS”, indicating the type of reads that were aligned to the reference.
>>> c.readType 'standard'
-
readsInRange
(refKey, refStart, refEnd, justIndices=False)¶ Get a list of reads overlapping (i.e., intersecting—not necessarily spanning) a given reference window.
If justIndices is
False
, the list returned will contain CmpH5Alignment objects.If justIndices is
True
, the list returned will contain row numbers in the alignment index table. Slicing the CmpH5Reader object with these row numbers can be used to get the corresponding CmpH5Alignment objects.The contig key can be either the
RefID
, or the short name (FASTA header up to first space).>>> c.readsInRange(1, 0, 1000) [CmpH5 alignment: - 1 0 290, CmpH5 alignment: + 1 0 365] >>> rowNumbers = c.readsInRange(1, 0, 1000, justIndices=True) >>> rowNumbers array([0, 1], dtype=uint32)
-
referenceInfo
(key)¶ Access information about a reference that was aligned against. Key can be reference ID (integer), name (“ref000001”), full name (e.g. “lambda_NEB3011”), truncated full name (full name up to the first whitespace, following the samtools convention) or MD5 sum hex string (e.g. “a1319ff90e994c8190a4fe6569d0822a”).
The returned value is a record from the :ref:referenceInfoTable .
>>> ri = c.referenceInfo("ref000001") >>> ri.FullName 'lambda_NEB3011' >>> ri.MD5 'a1319ff90e994c8190a4fe6569d0822a'
-
referenceInfoTable
¶ Return a numpy recarray summarizing the references that were aligned against.
The dtype of this recarray is:
dtype([('RefInfoID', int), ('ID', int), ('Name', string), ('FullName', string), ('Length', int), ('MD5', string), ('StartRow', int), ('EndRow', int) ])
(the last two columns are omitted for unsorted cmp.h5 files).
-
sequencingChemistry
¶
-
softwareVersion
(programName)¶ Return the version of program programName that processed this file.
-
version
¶ The CmpH5 format version string.
>>> c.version '1.2.0.SF'
-
versionAtLeast
(minimalVersion)¶ Compare the file version to minimalVersion.
>>> c.versionAtLeast("1.3.0") False
-
-
class
pbcore.io.align.CmpH5IO.
CmpH5Alignment
(cmph5, rowNumber)¶ Bases:
pbcore.io.align._AlignmentMixin.AlignmentRecordMixin
A lightweight class representing a single alignment record in a CmpH5 file, providing access to all columns of a single row of the alignment index, and on-demand access to the corresponding sequence and pulse features.
CmpH5Alignment objects are obtained by slicing a CmpH5Reader object:
>>> c[26] CmpH5 alignment: + 1 7441 7699 >>> print c[26] CmpH5 alignment: + 1 7441 7699 Read: m110818_075520_42141_c100129202555500000315043109121112_s2_p0/1009/44_322 Reference: lambda_NEB3011 Read length: 278 Concordance: 0.871 12345678901234567890123456789001223456789012345678890112345678990112344567890011 AACTGGTCACGGTCGTGGCACTGGTGAAG-CT-GCATACTGATGCACTT-CAC-GCCACGCG-GG-ATG-AACCTG-T-G ||||||| |||| ||||||||| |||| || ||||||||| ||||| ||| |||||||| || ||| |||||| | | AACTGGT--CGGT--TGGCACTGG-GAAGCCTTGCATACTGA-GCACT-CCACGGCCACGCGGGGAATGAAACCTGGTGG 23456789012345678900123456678901234456789012345678901234566789011234556789012345 GCATTTGTGCTGCCGGGA-ACGGCG-TTTCGTGT-CTCTGCCGGTGTGGCAGCCGAA-ATGAC-AGAG-CGCGGCCTGGC |||||||||||||||||| |||||| |||||| | |||||||||||||||||||||| ||||| |||| ||||||||||| GCATTTGTGCTGCCGGGAAACGGCGTTTTCGT-TCCTCTGCCGGTGTGGCAGCCGAAAATGACAAGAGCCGCGGCCTGGC 67899012345677890123456789901123456789901233456789012345678901234556778901223455 CAG-AATGCAAT-AACGGGAGGCGC-TG-TGGCTGAT-TTCG-ATAACCTGTTCGATGCTGCCAT-TG-CCCGC-GCC-G ||| |||||||| |||||||||||| || |||||||| |||| |||||||||||||||||||||| || ||||| ||| | CAGAAATGCAATAAACGGGAGGCGCTTGCTGGCTGATTTTCGAATAACCTGTTCGATGCTGCCATTTGACCCGCGGCCGG 6678901234567889012345667890123456789012345678 -ATGAAACGATAC-GCGGGTAC-ATGGGAACGTCAGCCACCATTAC |||||||||||| |||||||| ||||||||||||||||||||||| AATGAAACGATACGGCGGGTACAATGGGAACGTCAGCCACCATTAC
The orientation argument to some data access methods determines how reverse-strand alignments are returned to the user. .cmp.h5 files natively encode reverse strand reads in read-order, uncomplemented, with the reference reverse-complemented. Most analysis applications will want to use the data in this order, which we term the NATIVE orientation.
Some applications that involve collating or displaying the reads aligned to the reference genome want the reference to be presented in its genomic order, and the read to be reverse-complemented. We term this GENOMIC orientation.
Methods prefixed with aligned return data (bases or features) that include gaps, which are encoded according to the data type. Methods not prefixed with aligned excise the gaps.
Sequence and pulse features are not cached.
-
DeletionQV
(aligned=True, orientation='native')¶
-
DeletionTag
(aligned=True, orientation='native')¶
-
IPD
(aligned=True, orientation='native')¶
-
InsertionQV
(aligned=True, orientation='native')¶
-
MergeQV
(aligned=True, orientation='native')¶
-
PulseWidth
(aligned=True, orientation='native')¶
-
QualityValue
(aligned=True, orientation='native')¶
-
ReadGroupID
¶
-
SubstitutionQV
(aligned=True, orientation='native')¶
-
aEnd
¶
-
aStart
¶
-
accuracy
¶ Return the identity of this alignment, calculated as (#matchs / read length)
Deprecated since version 0.9.5: Use
identity
-
alignmentArray
(orientation='native')¶ Direct access to the raw, encoded aligment array, which is a packed representation of the aligned read and reference.
-
barcode
¶ The barcode ID (integer key) for this alignment’s read Behavior undefined if file is not barcoded.
-
barcodeName
¶ The barcode name (string) for this alignment’s read Behavior undefined if file is not barcoded.
-
clippedTo
(refStart, refEnd)¶ Return a new CmpH5Alignment that refers to a subalignment of this alignment, as induced by clipping to reference coordinates refStart to refEnd.
Warning
This function takes time linear in the length of the alignment.
-
cmpH5
¶
-
holeNumber
¶
-
identity
¶ Return the identity of this alignment, calculated as (#matchs / read length)
>>> c[26].identity 0.87050359712230219
-
isForwardStrand
¶
-
isReverseStrand
¶
-
mapQV
¶
-
movieInfo
¶ Deprecated since version 0.9.2: Use
readGroupInfo
, which is compatible with BAM usageReturns a record (extracted from the cmph5’s movieInfoTable) containing information about the movie that the read was extracted from. This record should be accessed using dot-notation, according to the column names documented in movieInfoTable.
-
movieName
¶
-
numPasses
¶ (CCS only) The number subreads that were used to produce this CCS read.
-
pulseFeature
(featureName, aligned=True, orientation='native')¶ Access a pulse feature by name.
-
qId
¶
-
read
(aligned=True, orientation='native')¶ Return the read portion of the alignment as a string.
If aligned is true, the aligned representation is returned, including gaps; otherwise the unaligned read basecalls are returned.
If orientation is “native”, the returned read bases are presented in the order they were read by the sequencing machine. If orientation is “genomic”, the returned read bases are presented in such a way as to collate with the forward strand of the reference—which requires reverse complementation of reverse-strand reads.
-
readGroupInfo
¶ Returns the corresponding record from the readGroupTable.
-
readPositions
(orientation='native')¶ Returns an array of read positions such that readPositions[i] = read position of the i’th column in the alignment. Insertions are grouped with the following read base, in the specified orientation.
Length of output array = length of alignment
-
readType
¶
-
reader
¶
-
reference
(aligned=True, orientation='native')¶ Return the read portion of the alignment as a string.
If aligned is true, the aligned representation of the reference is returned, including gaps; otherwise the unaligned reference bases are returned.
If orientation is “native”, the reference is presented in the order it is stored in the cmp.h5 file—for reverse-strand reads, the reference is reverse-complemented. If orientation is “genomic”, the forward strand reference is returned.
-
referenceId
¶
-
referenceInfo
¶
-
referenceName
¶
-
referencePositions
(orientation='native')¶ Returns an array of reference positions such that referencePositions[i] = reference position of the i’th column in the alignment. Insertions are grouped with the following reference base, in the specified orientation.
Length of output array = length of alignment
-
rowNumber
¶
-
sequencingChemistry
¶
-
similarity
¶ Replicates the pctsimilarity field from blasr, calculated as #matches/mean(aligned_length, read_length)
-
transcript
(orientation='native', style='gusfield')¶ A text representation of the alignment moves (see Gusfield). This can be useful in pretty-printing an alignment.
-
zScore
¶ (PacBio internal files only)
The z-score of the alignment, using a null model of a random sequence alignment.
-
-
exception
pbcore.io.align.CmpH5IO.
EmptyCmpH5Error
¶ Bases:
exceptions.Exception
An exception raised when CmpH5Reader tries to read from a cmp.h5 with no alignments.
pbcore.io.align.PacBioBamIndex module¶
-
class
pbcore.io.align.PacBioBamIndex.
PacBioBamIndex
(pbiFilename)¶ Bases:
object
The PacBio BAM index is a companion file allowing modest semantic queries on PacBio BAM files without iterating over the entire file. By convention, the PacBio BAM index has extension “bam.pbi”.
-
columnNames
¶
-
rangeQuery
(winId, winStart, winEnd)¶
-
version
¶
-