pbcore.io¶
The pbcore.io
package provides a number of lightweight interfaces
to PacBio data files and other standard bioinformatics file formats.
Preferred usage is to import classes directly from the pbcore.io
package, e.g.:
>>> from pbcore.io import CmpH5Reader
The classes within pbcore.io
adhere to a few conventions, in order
to provide a uniform API:
Each data file type is thought of as a container of a Record type; all Reader classes support streaming access, and CmpH5Reader and BasH5Reader additionally provide random-access to alignments/reads.
The constructor argument needed to instantiate Reader and Writer objects can be either a filename (which can be suffixed by ”.gz” for all but the h5 file types) or an open file handle. The reader/writer classes will do what you would expect.
The reader/writer classes all support the context manager idiom. Meaning, if you write:
>>> with CmpH5Reader("aligned_reads.cmp.h5") as r: ... print r[0].read()the CmpH5Reader object will be automatically closed after the block within the “with” statement is executed.
BAM/cmp.h5 compatibility: quick start¶
If you have an application that uses the CmpH5Reader and you want to start using BAM files, your best bet is to use the following generic factory functions:
-
pbcore.io.
openIndexedAlignmentFile
(fname, referenceFastaFname=None, sharedIndex=None)¶ Factory function to get a handle to a reader for an alignment file (cmp.h5 or BAM), requiring index capability (built-in for cmp.h5; requires bam.pbi index for BAM
The reference FASTA, if provided, must have a FASTA index (fasta.fai).
-
pbcore.io.
openAlignmentFile
(fname, referenceFastaFname=None, sharedIndex=None)¶ Factory function to get a handle to a reader for an alignment file (cmp.h5 or BAM), not requiring index capability
(A sharedIndex can still be passed for opening a cmp.h5, for which the index is compulsory.)
Note
Since BAM files contain a subset of the information that was present in cmp.h5 files, you will need to provide these functions an indexed FASTA file for your reference. For full compatibility, you need the openIndexedAlignmentFile function, which requires the existence of a bam.pbi file (PacBio BAM index companion file).
bas.h5 / bax.h5 Formats (PacBio basecalls file)¶
The bas.h5/ bax.h5 file formats are container formats for PacBio
reads, built on top of the HDF5 standard. Originally there was just
one bas.h5, but eventually “multistreaming” came along and we had to
split the file into three bax.h5 parts and one bas.h5 file
containing pointers to the parts. Use BasH5Reader
to read any
kind of bas.h5 file, and BaxH5Reader
to read a bax.h5.
Note
In contrast to GFF, for example, the bas.h5 read coordinate system is 0-based and start-inclusive/end-exclusive, i.e. the same convention as Python and the C++ STL.
-
class
pbcore.io.
BasH5Reader
(*args)¶ The BasH5Reader provides access to the basecall and pulse metric data encoded in PacBio bas.h5 files. To access data using a BasH5Reader, the standard idiom is:
Index into the BasH5Reader using the ZMW hole number to get a Zmw object:
>>> b <BasH5Reader: m110818_075520_42141_c100129202555500000315043109121112_s1_p0> >>> zmw8 = b[8] >>> zmw8 <Zmw: m110818_075520_42141_c100129202555500000315043109121112_s1_p0/8>
Extract ZmwRead objects from the Zmw object by:
Using the .subreads property to extract the subreads, which are the subintervals of the raw read corresponding to the SMRTbell insert:
>>> subreads = zmw8.subreads >>> print subreads [<ZmwRead: m110818_075520_42141_c100129202555500000315043109121112_s1_p0/8/3381_3881>, <ZmwRead: m110818_075520_42141_c100129202555500000315043109121112_s1_p0/8/3924_4398>, <ZmwRead: m110818_075520_42141_c100129202555500000315043109121112_s1_p0/8/4445_4873>, <ZmwRead: m110818_075520_42141_c100129202555500000315043109121112_s1_p0/8/4920_5354>, <ZmwRead: m110818_075520_42141_c100129202555500000315043109121112_s1_p0/8/5413_5495>]
For CCS bas files, using the .ccsRead property to extract the CCS (consensus) read, which is a consensus sequence precomputed from the subreads. Older bas files, from when CCS was computed on the instrument, may contain both CCS- and sub- reads.
>>> zmw8.ccsRead <CCSZmwRead: m110818_075520_42141_c100129202555500000315043109121112_s1_p0/8/ccs>
Use the .read() method to get the full raw read, or .read(start, end) to extract a custom subinterval.
>>> zmw8.read() <ZmwRead: m110818_075520_42141_c100129202555500000315043109121112_s1_p0/8/3381_5495> >>> zmw8.read(3390, 3400) <ZmwRead: m110818_075520_42141_c100129202555500000315043109121112_s1_p0/8/3390_3400>
With a ZmwRead object in hand, extract the desired basecalls and pulse metrics:
>>> subreads[0].readName "m110818_075520_42141_c100129202555500000315043109121112_s1_p0/8/3381_3881" >>> subreads[0].basecalls() "AGCCCCGTCGAGAACATACAGGTGGCCAATTTCACAGCCTCTTGCCTGGGCGATCCCGAACATCGCACCGGA..." >>> subreads[0].InsertionQV() array([12, 12, 10, 2, 7, 14, 13, 18, 15, 16, 16, 15, 10, 12, 3, 14, ...])
Note that not every ZMW on a chip produces usable sequencing data. The BasH5Reader has a property sequencingZmws is a list of the hole numbers where usable sequence was recorded. Iteration over the BasH5Reader object allows you to iterate over the Zmw objects providing usable sequence.
-
class
CCSZmwRead
(baxH5, holeNumber, readStart, readEnd)¶ Class providing access to the CCS (circular consensus sequencing) data calculated for a ZMW.
-
readName
¶
-
-
class
BasH5Reader.
Zmw
(baxH5, holeNumber)¶ A Zmw represents all data from a ZMW (zero-mode waveguide) hole within a bas.h5 movie file. Accessor methods provide convenient access to the read (or subreads), and to the region table entries for this hole.
-
adapterRegions
¶ Get adapter regions as intervals, performing clipping to the HQ region
-
adapterRegionsNoQC
¶ Get adapter regions as intervals, without clipping to the HQ region. Don’t use this unless you know what you’re doing.
-
adapters
¶ Get the adapter hits as a list of ZmwRead objects. Restricts focus, and clips to, the HQ region. This method can be used by production code.
-
adaptersNoQC
¶ Get the adapters, including data beyond the bounds of the HQ region.
Warning
It is not recommended that production code use this method as we make no guarantees about what happens outside of the HQ region.
-
baxH5
¶
-
ccsRead
¶
-
holeNumber
¶
-
hqRegion
¶ Return the HQ region interval.
The HQ region is an interval of basecalls where the basecaller has inferred that a single sequencing reaction is taking place. Secondary analysis should only use subreads within the HQ region. Methods in this class, with the exception of the “NoQC” methods, return data appropriately clipped/filtered to the HQ region.
-
hqRegionSnr
¶ Return the SNRs, as a vector by channel.
-
index
¶
-
insertRegions
¶ Get insert regions as intervals, clipped to the HQ region
-
insertRegionsNoQC
¶ Get insert regions as intervals, without clipping to the HQ region. Don’t use this unless you know what you’re doing.
-
listZmwMetrics
()¶ List the available ZMW metrics for this bax.h5 file.
-
numPasses
¶ Return the number of passes (forward + back) across the SMRTbell insert, used to forming the CCS consensus.
-
productivity
¶ Return the ‘productivity’ of this ZMW, which is the estimated number of polymerase reactions taking place within it. For example, a doubly-loaded ZMW would have productivity 2.
-
read
(readStart=None, readEnd=None)¶ Given no arguments, returns the entire (HQ-clipped) polymerase read. With readStart, readEnd arguments, returns the specified extent of the polymerase read.
-
readNoQC
(readStart=None, readEnd=None)¶ Given no arguments, returns the entire polymerase read, not HQ-clipped. With readStart, readEnd arguments, returns the specified extent of the polymerase read.
Warning
It is not recommended that production code use this method as we make no guarantees about what happens outside of the HQ region.
-
readScore
¶ Return the “read score”, a prediction of the accuracy (between 0 and 1) of the basecalls from this ZMW, from the ReadScore dataset in the file
-
regionTable
¶
-
subreads
¶ Get the subreads as a list of ZmwRead objects. Restricts focus, and clips to, the HQ region. This method can be used by production code.
-
subreadsNoQC
¶ Get the subreads, including data beyond the bounds of the HQ region.
Warning
It is not recommended that production code use this method as we make no guarantees about what happens outside of the HQ region.
-
zmwMetric
(name)¶ Return the value of metric ‘name’ from the ZMW metrics.
-
zmwName
¶
-
-
class
BasH5Reader.
ZmwRead
(baxH5, holeNumber, readStart, readEnd)¶ A ZmwRead represents the data features (basecalls as well as pulse features) recorded from the ZMW, delimited by readStart and readEnd.
-
DeletionQV
()¶
-
DeletionTag
()¶
-
IPD
()¶
-
InsertionQV
()¶
-
MergeQV
()¶
-
PreBaseFrames
()¶
-
PulseWidth
()¶
-
QualityValue
()¶
-
SubstitutionQV
()¶
-
SubstitutionTag
()¶
-
WidthInFrames
()¶
-
basecalls
()¶
-
baxH5
¶
-
holeNumber
¶
-
offsetBegin
¶
-
offsetEnd
¶
-
qv
(qvName)¶
-
readEnd
¶
-
readName
¶
-
readStart
¶
-
zmw
¶
-
-
BasH5Reader.
allSequencingZmws
¶
-
BasH5Reader.
ccsReads
()¶
-
BasH5Reader.
chemistryBarcodeTriple
¶
-
BasH5Reader.
close
()¶
-
BasH5Reader.
hasConsensusBasecalls
¶
-
BasH5Reader.
hasRawBasecalls
¶
-
BasH5Reader.
movieName
¶
-
BasH5Reader.
parts
¶
-
BasH5Reader.
reads
()¶
-
BasH5Reader.
sequencingChemistry
¶
-
BasH5Reader.
sequencingZmws
¶
-
BasH5Reader.
subreads
()¶
-
class
pbcore.io.BasH5IO.
Zmw
(baxH5, holeNumber)¶ A Zmw represents all data from a ZMW (zero-mode waveguide) hole within a bas.h5 movie file. Accessor methods provide convenient access to the read (or subreads), and to the region table entries for this hole.
-
adapterRegions
¶ Get adapter regions as intervals, performing clipping to the HQ region
-
adapterRegionsNoQC
¶ Get adapter regions as intervals, without clipping to the HQ region. Don’t use this unless you know what you’re doing.
-
adapters
¶ Get the adapter hits as a list of ZmwRead objects. Restricts focus, and clips to, the HQ region. This method can be used by production code.
-
adaptersNoQC
¶ Get the adapters, including data beyond the bounds of the HQ region.
Warning
It is not recommended that production code use this method as we make no guarantees about what happens outside of the HQ region.
-
baxH5
¶
-
ccsRead
¶
-
holeNumber
¶
-
hqRegion
¶ Return the HQ region interval.
The HQ region is an interval of basecalls where the basecaller has inferred that a single sequencing reaction is taking place. Secondary analysis should only use subreads within the HQ region. Methods in this class, with the exception of the “NoQC” methods, return data appropriately clipped/filtered to the HQ region.
-
hqRegionSnr
¶ Return the SNRs, as a vector by channel.
-
index
¶
-
insertRegions
¶ Get insert regions as intervals, clipped to the HQ region
-
insertRegionsNoQC
¶ Get insert regions as intervals, without clipping to the HQ region. Don’t use this unless you know what you’re doing.
-
listZmwMetrics
()¶ List the available ZMW metrics for this bax.h5 file.
-
numPasses
¶ Return the number of passes (forward + back) across the SMRTbell insert, used to forming the CCS consensus.
-
productivity
¶ Return the ‘productivity’ of this ZMW, which is the estimated number of polymerase reactions taking place within it. For example, a doubly-loaded ZMW would have productivity 2.
-
read
(readStart=None, readEnd=None)¶ Given no arguments, returns the entire (HQ-clipped) polymerase read. With readStart, readEnd arguments, returns the specified extent of the polymerase read.
-
readNoQC
(readStart=None, readEnd=None)¶ Given no arguments, returns the entire polymerase read, not HQ-clipped. With readStart, readEnd arguments, returns the specified extent of the polymerase read.
Warning
It is not recommended that production code use this method as we make no guarantees about what happens outside of the HQ region.
-
readScore
¶ Return the “read score”, a prediction of the accuracy (between 0 and 1) of the basecalls from this ZMW, from the ReadScore dataset in the file
-
regionTable
¶
-
subreads
¶ Get the subreads as a list of ZmwRead objects. Restricts focus, and clips to, the HQ region. This method can be used by production code.
-
subreadsNoQC
¶ Get the subreads, including data beyond the bounds of the HQ region.
Warning
It is not recommended that production code use this method as we make no guarantees about what happens outside of the HQ region.
-
zmwMetric
(name)¶ Return the value of metric ‘name’ from the ZMW metrics.
-
zmwName
¶
-
-
class
pbcore.io.BasH5IO.
ZmwRead
(baxH5, holeNumber, readStart, readEnd)¶ A ZmwRead represents the data features (basecalls as well as pulse features) recorded from the ZMW, delimited by readStart and readEnd.
-
DeletionQV
()¶
-
DeletionTag
()¶
-
IPD
()¶
-
InsertionQV
()¶
-
MergeQV
()¶
-
PreBaseFrames
()¶
-
PulseWidth
()¶
-
QualityValue
()¶
-
SubstitutionQV
()¶
-
SubstitutionTag
()¶
-
WidthInFrames
()¶
-
basecalls
()¶
-
baxH5
¶
-
holeNumber
¶
-
offsetBegin
¶
-
offsetEnd
¶
-
qv
(qvName)¶
-
readEnd
¶
-
readName
¶
-
readStart
¶
-
zmw
¶
-
BAM format¶
The BAM format is a standard format described aligned and unaligned
reads. PacBio is transitioning from the cmp.h5 format to the BAM
format. For basic functionality, one should use BamReader
;
for full compatibility with the CmpH5Reader
API (including
alignment index functionality) one should use
IndexedBamReader
, which requires the auxiliary PacBio BAM
index file (bam.pbi
file).
-
class
pbcore.io.
BamAlignment
(bamReader, pysamAlignedRead, rowNumber=None)¶ -
DeletionQV
(aligned=True, orientation='native')¶
-
DeletionTag
(aligned=True, orientation='native')¶
-
HoleNumber
¶
-
IPD
(aligned=True, orientation='native')¶
-
InsertionQV
(aligned=True, orientation='native')¶
-
MapQV
¶
-
MergeQV
(aligned=True, orientation='native')¶
-
PulseWidth
(aligned=True, orientation='native')¶
-
SubstitutionQV
(aligned=True, orientation='native')¶
-
barcode
¶
-
barcodeName
¶
-
clippedTo
(bamAln, *args, **kwargs)¶ Return a new BamAlignment that refers to a subalignment of this alignment, as induced by clipping to reference coordinates refStart to refEnd.
Warning
This function takes time linear in the length of the alignment.
-
identity
¶
-
isCCS
¶
-
isForwardStrand
¶
-
isMapped
¶
-
isReverseStrand
¶
-
isUnmapped
¶
-
mapQV
¶
-
movieName
¶
-
numPasses
¶
-
pulseFeature
(featureName, aligned=True, orientation='native')¶ Retrieve the pulse feature as indicated. - aligned : whether gaps should be inserted to reflect the alignment - orientation: “native” or “genomic”
Note that this function assumes the the feature is stored in native orientation in the file, so it is not appropriate to use this method to fetch the read or the qual, which are oriented genomically in the file.
-
qEnd
¶
-
qId
¶
-
qLen
¶
-
qName
¶
-
qStart
¶
-
queryEnd
¶
-
queryName
¶
-
queryStart
¶
-
read
(aligned=True, orientation='native')¶
-
readGroupInfo
¶
-
readPositions
(aligned=True, orientation='native')¶ Returns an array of read positions.
If aligned is True, the array has the same length as the alignment and readPositions[i] = read position of the i’th column in the oriented alignment.
If aligned is False, the array has the same length as the mapped reference segment and readPositions[i] = read position of the i’th base in the oriented reference segment.
-
readType
¶
-
reader
¶
-
reference
(bamAln, *args, **kwargs)¶
-
referenceId
¶
-
referenceInfo
¶
-
referenceName
¶
-
referencePositions
(bamAln, *args, **kwargs)¶ Returns an array of reference positions.
If aligned is True, the array has the same length as the alignment and referencePositions[i] = reference position of the i’th column in the oriented alignment.
If aligned is False, the array has the same length as the read and referencePositions[i] = reference position of the i’th base in the oriented read.
-
sequencingChemistry
¶
-
tId
¶
-
transcript
(orientation='native', style='gusfield')¶ A text representation of the alignment moves (see Gusfield). This can be useful in pretty-printing an alignment.
-
unrolledCigar
(bamAln, *args, **kwargs)¶ Run-length decode the CIGAR encoding, and orient. Clipping ops are removed.
-
zScore
¶
-
-
class
pbcore.io.
BamReader
(fname, referenceFastaFname=None)¶ Reader for a BAM with a bam.bai (SAMtools) index, but not a bam.pbi (PacBio) index. Supports basic BAM operations.
-
index
¶
-
readsInRange
(winId, winStart, winEnd, justIndices=False)¶
-
-
class
pbcore.io.
IndexedBamReader
(fname, referenceFastaFname=None, sharedIndex=None)¶ A IndexedBamReader is a BAM reader class that uses the
bam.pbi
(PacBio BAM index) file to enable random access by “row number” and to provide access to precomputed semantic information about the BAM records-
atRowNumber
(rn)¶
-
index
¶
-
readsInRange
(winId, winStart, winEnd, justIndices=False)¶
-
cmp.h5 format (legacy PacBio alignment file)¶
The cmp.h5 file format is an alignment format built on top of the HDF5 standard. It is a simple container format for PacBio alignment records.
Note
In contrast to GFF, for example, all cmp.h5 coordinate systems (refererence, read) are 0-based and start-inclusive/end-exclusive, i.e. the same convention as Python and the C++ STL.
-
class
pbcore.io.
CmpH5Reader
(filenameOrH5File, sharedIndex=None)¶ The CmpH5Reader class is a lightweight and efficient API for accessing PacBio
cmp.h5
alignment files. Alignment records can be obtained via random access (via Python indexing/slicing), iteration, or range queries (via readsInRange).>>> import pbcore.data # For an example data file >>> from pbcore.io import CmpH5Reader >>> filename = pbcore.data.getCmpH5() >>> c = CmpH5Reader(filename) >>> c[0] CmpH5 alignment: - 1 0 290 >>> c[0:2] [CmpH5 alignment: - 1 0 290, CmpH5 alignment: + 1 0 365] >>> sum(aln.readLength for aln in c) 26103
-
ReadGroupID
¶
-
alignmentIndex
¶ Return the alignment index data structure, which is the central data structure in the cmp.h5 file, as a numpy recarray.
The dtype of the recarray is:
dtype([('AlnID', int), ('AlnGroupID', int), ('MovieID', int), ('RefGroupID', int), ('tStart', int), ('tEnd', int), ('RCRefStrand', int), ('HoleNumber', int), ('SetNumber', int), ('StrobeNumber', int), ('MoleculeID', int), ('rStart', int), ('rEnd', int), ('MapQV', int), ('nM', int), ('nMM', int), ('nIns', int), ('nDel', int), ('Offset_begin', int), ('Offset_end', int), ('nBackRead', int), ('nReadOverlap', int)])
Access to the alignment index is provided to allow users to perform vectorized computations over all alignments in the file.
>>> c.alignmentIndex.MapQV[0:10] array([254, 254, 0, 254, 254, 254, 254, 254, 254, 254], dtype=uint32)
Alignment index fields are also exposed as fields of the CmpH5Reader object, allowing a convenient shorthand.
>>> c.MapQV[0:10] array([254, 254, 0, 254, 254, 254, 254, 254, 254, 254], dtype=uint32)
The alignment index row for a given alignment can also be accessed directly as a field of a CmpH5Alignment object
>>> c[26].MapQV 254
-
barcode
¶ Returns a dict mapping of barcode name to integer barcode. Behavior undefined if file is not barcoded.
-
barcodeName
¶ Returns a dict mapping of barcode integer id to name. Behavior undefined if file is not barcoded.
-
barcodes
¶ Returns an array of barcode integer ids, of the same length as the alignment array.
Behavior undefined if file is not barcoded.
-
close
()¶
-
hasPulseFeature
(featureName)¶ Are the datasets for pulse feature featureName loaded in this file? Specifically, is it loaded for all movies within this cmp.h5?
>>> c.hasPulseFeature("InsertionQV") True >>> c.hasPulseFeature("MergeQV") False
-
holeNumber
¶
-
index
¶
-
isBarcoded
¶
-
isEmpty
¶
-
isSorted
¶
-
mapQV
¶
-
movieInfo
(movieId)¶ Deprecated since version 0.9.2: Use
readGroupInfo
, which is compatible with BAM usageAccess information about a movie whose reads are represented in the file.
The returned value is a record from the
movieInfoTable
-
movieInfoTable
¶ Deprecated since version 0.9.2: Use
readGroupTable
, which is compatible with BAM usageReturn a numpy recarray summarizing source movies for the reads in this file.
The dtype of this recarray is:
dtype([('ID', 'int'), ('Name', 'string'), ('FrameRate', 'float'), ('TimeScale', 'float')])
TimeScale is the factor to multiply time values (IPD, PulseWidth) by in order to get times in seconds. The FrameRate field should not be used directly as it will be NaN for pre-1.3 cmp.h5 files.
-
movieNames
¶
-
pulseFeaturesAvailable
()¶ What pulse features are available in this cmp.h5 file?
>>> c.pulseFeaturesAvailable() [u'QualityValue', u'IPD', u'PulseWidth', u'InsertionQV', u'DeletionQV']
-
qId
¶
-
readGroupInfo
(rgId)¶ Access information about a movie whose reads are represented in the file.
The returned value is a record from the
readGroupTable
-
readGroupTable
¶
-
readType
¶ Either “standard” or “CCS”, indicating the type of reads that were aligned to the reference.
>>> c.readType 'standard'
-
readsInRange
(refKey, refStart, refEnd, justIndices=False)¶ Get a list of reads overlapping (i.e., intersecting—not necessarily spanning) a given reference window.
If justIndices is
False
, the list returned will contain CmpH5Alignment objects.If justIndices is
True
, the list returned will contain row numbers in the alignment index table. Slicing the CmpH5Reader object with these row numbers can be used to get the corresponding CmpH5Alignment objects.The contig key can be either the
RefID
, or the short name (FASTA header up to first space).>>> c.readsInRange(1, 0, 1000) [CmpH5 alignment: - 1 0 290, CmpH5 alignment: + 1 0 365] >>> rowNumbers = c.readsInRange(1, 0, 1000, justIndices=True) >>> rowNumbers array([0, 1], dtype=uint32)
-
referenceInfo
(key)¶ Access information about a reference that was aligned against. Key can be reference ID (integer), name (“ref000001”), full name (e.g. “lambda_NEB3011”), truncated full name (full name up to the first whitespace, following the samtools convention) or MD5 sum hex string (e.g. “a1319ff90e994c8190a4fe6569d0822a”).
The returned value is a record from the :ref:referenceInfoTable .
>>> ri = c.referenceInfo("ref000001") >>> ri.FullName 'lambda_NEB3011' >>> ri.MD5 'a1319ff90e994c8190a4fe6569d0822a'
-
referenceInfoTable
¶ Return a numpy recarray summarizing the references that were aligned against.
The dtype of this recarray is:
dtype([('RefInfoID', int), ('ID', int), ('Name', string), ('FullName', string), ('Length', int), ('MD5', string), ('StartRow', int), ('EndRow', int) ])
(the last two columns are omitted for unsorted cmp.h5 files).
-
sequencingChemistry
¶
-
softwareVersion
(programName)¶ Return the version of program programName that processed this file.
-
version
¶ The CmpH5 format version string.
>>> c.version '1.2.0.SF'
-
versionAtLeast
(minimalVersion)¶ Compare the file version to minimalVersion.
>>> c.versionAtLeast("1.3.0") False
-
-
class
pbcore.io.
CmpH5Alignment
(cmph5, rowNumber)¶ A lightweight class representing a single alignment record in a CmpH5 file, providing access to all columns of a single row of the alignment index, and on-demand access to the corresponding sequence and pulse features.
CmpH5Alignment objects are obtained by slicing a CmpH5Reader object:
>>> c[26] CmpH5 alignment: + 1 7441 7699 >>> print c[26] CmpH5 alignment: + 1 7441 7699 Read: m110818_075520_42141_c100129202555500000315043109121112_s2_p0/1009/44_322 Reference: lambda_NEB3011 Read length: 278 Concordance: 0.871 12345678901234567890123456789001223456789012345678890112345678990112344567890011 AACTGGTCACGGTCGTGGCACTGGTGAAG-CT-GCATACTGATGCACTT-CAC-GCCACGCG-GG-ATG-AACCTG-T-G ||||||| |||| ||||||||| |||| || ||||||||| ||||| ||| |||||||| || ||| |||||| | | AACTGGT--CGGT--TGGCACTGG-GAAGCCTTGCATACTGA-GCACT-CCACGGCCACGCGGGGAATGAAACCTGGTGG 23456789012345678900123456678901234456789012345678901234566789011234556789012345 GCATTTGTGCTGCCGGGA-ACGGCG-TTTCGTGT-CTCTGCCGGTGTGGCAGCCGAA-ATGAC-AGAG-CGCGGCCTGGC |||||||||||||||||| |||||| |||||| | |||||||||||||||||||||| ||||| |||| ||||||||||| GCATTTGTGCTGCCGGGAAACGGCGTTTTCGT-TCCTCTGCCGGTGTGGCAGCCGAAAATGACAAGAGCCGCGGCCTGGC 67899012345677890123456789901123456789901233456789012345678901234556778901223455 CAG-AATGCAAT-AACGGGAGGCGC-TG-TGGCTGAT-TTCG-ATAACCTGTTCGATGCTGCCAT-TG-CCCGC-GCC-G ||| |||||||| |||||||||||| || |||||||| |||| |||||||||||||||||||||| || ||||| ||| | CAGAAATGCAATAAACGGGAGGCGCTTGCTGGCTGATTTTCGAATAACCTGTTCGATGCTGCCATTTGACCCGCGGCCGG 6678901234567889012345667890123456789012345678 -ATGAAACGATAC-GCGGGTAC-ATGGGAACGTCAGCCACCATTAC |||||||||||| |||||||| ||||||||||||||||||||||| AATGAAACGATACGGCGGGTACAATGGGAACGTCAGCCACCATTAC
The orientation argument to some data access methods determines how reverse-strand alignments are returned to the user. .cmp.h5 files natively encode reverse strand reads in read-order, uncomplemented, with the reference reverse-complemented. Most analysis applications will want to use the data in this order, which we term the NATIVE orientation.
Some applications that involve collating or displaying the reads aligned to the reference genome want the reference to be presented in its genomic order, and the read to be reverse-complemented. We term this GENOMIC orientation.
Methods prefixed with aligned return data (bases or features) that include gaps, which are encoded according to the data type. Methods not prefixed with aligned excise the gaps.
Sequence and pulse features are not cached.
-
DeletionQV
(aligned=True, orientation='native')¶
-
DeletionTag
(aligned=True, orientation='native')¶
-
IPD
(aligned=True, orientation='native')¶
-
InsertionQV
(aligned=True, orientation='native')¶
-
MergeQV
(aligned=True, orientation='native')¶
-
PulseWidth
(aligned=True, orientation='native')¶
-
QualityValue
(aligned=True, orientation='native')¶
-
ReadGroupID
¶
-
SubstitutionQV
(aligned=True, orientation='native')¶
-
aEnd
¶
-
aStart
¶
-
accuracy
¶ Return the identity of this alignment, calculated as (#matchs / read length)
Deprecated since version 0.9.5: Use
identity
-
alignmentArray
(orientation='native')¶ Direct access to the raw, encoded aligment array, which is a packed representation of the aligned read and reference.
-
barcode
¶ The barcode ID (integer key) for this alignment’s read Behavior undefined if file is not barcoded.
-
barcodeName
¶ The barcode name (string) for this alignment’s read Behavior undefined if file is not barcoded.
-
clippedTo
(refStart, refEnd)¶ Return a new CmpH5Alignment that refers to a subalignment of this alignment, as induced by clipping to reference coordinates refStart to refEnd.
Warning
This function takes time linear in the length of the alignment.
-
cmpH5
¶
-
holeNumber
¶
-
identity
¶ Return the identity of this alignment, calculated as (#matchs / read length)
>>> c[26].identity 0.87050359712230219
-
isForwardStrand
¶
-
isReverseStrand
¶
-
mapQV
¶
-
movieInfo
¶ Deprecated since version 0.9.2: Use
readGroupInfo
, which is compatible with BAM usageReturns a record (extracted from the cmph5’s movieInfoTable) containing information about the movie that the read was extracted from. This record should be accessed using dot-notation, according to the column names documented in movieInfoTable.
-
movieName
¶
-
numPasses
¶ (CCS only) The number subreads that were used to produce this CCS read.
-
pulseFeature
(featureName, aligned=True, orientation='native')¶ Access a pulse feature by name.
-
qId
¶
-
read
(aligned=True, orientation='native')¶ Return the read portion of the alignment as a string.
If aligned is true, the aligned representation is returned, including gaps; otherwise the unaligned read basecalls are returned.
If orientation is “native”, the returned read bases are presented in the order they were read by the sequencing machine. If orientation is “genomic”, the returned read bases are presented in such a way as to collate with the forward strand of the reference—which requires reverse complementation of reverse-strand reads.
-
readGroupInfo
¶ Returns the corresponding record from the readGroupTable.
-
readPositions
(orientation='native')¶ Returns an array of read positions such that readPositions[i] = read position of the i’th column in the alignment. Insertions are grouped with the following read base, in the specified orientation.
Length of output array = length of alignment
-
readType
¶
-
reader
¶
-
reference
(aligned=True, orientation='native')¶ Return the read portion of the alignment as a string.
If aligned is true, the aligned representation of the reference is returned, including gaps; otherwise the unaligned reference bases are returned.
If orientation is “native”, the reference is presented in the order it is stored in the cmp.h5 file—for reverse-strand reads, the reference is reverse-complemented. If orientation is “genomic”, the forward strand reference is returned.
-
referenceId
¶
-
referenceInfo
¶
-
referenceName
¶
-
referencePositions
(orientation='native')¶ Returns an array of reference positions such that referencePositions[i] = reference position of the i’th column in the alignment. Insertions are grouped with the following reference base, in the specified orientation.
Length of output array = length of alignment
-
rowNumber
¶
-
sequencingChemistry
¶
-
similarity
¶ Replicates the pctsimilarity field from blasr, calculated as #matches/mean(aligned_length, read_length)
-
transcript
(orientation='native', style='gusfield')¶ A text representation of the alignment moves (see Gusfield). This can be useful in pretty-printing an alignment.
-
zScore
¶ (PacBio internal files only)
The z-score of the alignment, using a null model of a random sequence alignment.
-
FASTA Format¶
FASTA is a standard format for sequence data. We recommmend using the FastaTable class, which provides random access to indexed FASTA files (using the conventional SAMtools “fai” index).
-
pbcore.io.
FastaTable
¶ alias of
IndexedFastaReader
-
class
pbcore.io.
FastaRecord
(header, sequence)¶ A FastaRecord object models a named sequence in a FASTA file.
-
comment
¶ The comment associated with the sequence in the FASTA file, equal to the contents of the FASTA header following the first whitespace
-
classmethod
fromString
(s)¶ Interprets a string as a FASTA record. Does not make any assumptions about wrapping of the sequence string.
-
header
¶ The header of the sequence in the FASTA file, equal to the entire first line of the FASTA record following the ‘>’ character.
Warning
You should almost certainly be using “id”, not “header”.
-
id
¶ The id of the sequence in the FASTA file, equal to the FASTA header up to the first whitespace.
-
length
¶ Get the length of the FASTA sequence
-
name
¶ DEPRECATED: The name of the sequence in the FASTA file, equal to the entire FASTA header following the ‘>’ character
-
reverseComplement
(preserveHeader=False)¶ Return a new FastaRecord with the reverse-complemented DNA sequence. Optionally, supply a name
-
sequence
¶ The sequence for the record as present in the FASTA file. (Newlines are removed but otherwise no sequence normalization is performed).
-
-
class
pbcore.io.
FastaReader
(f)¶ Streaming reader for FASTA files, useable as a one-shot iterator over FastaRecord objects. Agnostic about line wrapping.
Example:
>>> from pbcore.io import FastaReader >>> from pbcore import data >>> filename = data.getTinyFasta() >>> r = FastaReader(filename) >>> for record in r: ... print record.header, len(record.sequence) ref000001|EGFR_Exon_2 183 ref000002|EGFR_Exon_3 203 ref000003|EGFR_Exon_4 215 ref000004|EGFR_Exon_5 157 >>> r.close()
-
class
pbcore.io.
FastaWriter
(f)¶ A FASTA file writer class
Example:
>>> from pbcore.io import FastaWriter >>> with FastaWriter("output.fasta.gz") as writer: ... writer.writeRecord("dog", "GATTACA") ... writer.writeRecord("cat", "CATTACA")
(Notice that underlying file will be automatically closed after exit from the with block.)
-
writeRecord
(*args)¶ Write a FASTA record to the file. If given one argument, it is interpreted as a
FastaRecord
. Given two arguments, they are interpreted as the name and the sequence.
-
FASTQ Format¶
FASTQ is a standard format for sequence data with associated quality scores.
-
class
pbcore.io.
FastqRecord
(header, sequence, quality=None, qualityString=None)¶ A
FastqRecord
object models a named sequence and its quality values in a FASTQ file. For reference consult Wikipedia’s FASTQ entry. We adopt the Sanger encoding convention, allowing the encoding of QV values in [0, 93] using ASCII 33 to 126. We only support FASTQ files in the four-line convention (unwrapped). Wrapped FASTQ files are generally considered a bad idea as the @, + delimiters can also appear in the quality string, thus parsing cannot be done safely.-
comment
¶ The comment associated with the sequence in the FASTQ file, equal to the contents of the FASTQ header following the first whitespace
-
classmethod
fromString
(s)¶ Interprets a string as a FASTQ record. Only supports four-line format, as wrapped FASTQs can’t easily be safely parsed.
-
header
¶ The header of the sequence in the FASTQ file
-
id
¶ The id of the sequence in the FASTQ file, equal to the FASTQ header up to the first whitespace.
-
length
¶ The length of the sequence
-
name
¶ DEPRECATED: The name of the sequence in the FASTQ file
-
quality
¶ The quality values, as an array of integers
-
qualityString
¶ The quality values as an ASCII-encoded string
-
reverseComplement
(preserveHeader=False)¶ Return a new FastaRecord with the reverse-complemented DNA sequence. Optionally, supply a name
-
sequence
¶ The sequence for the record as present in the FASTQ file.
-
-
class
pbcore.io.
FastqReader
(f)¶ Reader for FASTQ files, useable as a one-shot iterator over FastqRecord objects. FASTQ files must follow the four-line convention.
-
class
pbcore.io.
FastqWriter
(f)¶ A FASTQ file writer class
Example:
>>> from pbcore.io import FastqWriter >>> with FastqWriter("output.fq.gz") as writer: ... writer.writeRecord("dog", "GATTACA", [35]*7) ... writer.writeRecord("cat", "CATTACA", [35]*7)
(Notice that underlying file will be automatically closed after exit from the with block.)
-
writeRecord
(*args)¶ Write a FASTQ record to the file. If given one argument, it is interpreted as a
FastqRecord
. Given three arguments, they are interpreted as the name, sequence, and quality.
-
GFF Format (Version 3)¶
The GFF format is an open and flexible standard for representing genomic features.
-
class
pbcore.io.
Gff3Record
(seqid, start, end, type, score='.', strand='.', phase='.', source='.', attributes=())¶ Class for GFF record, providing uniform access to standard GFF fields and attributes.
>>> from pbcore.io import Gff3Record >>> record = Gff3Record("chr1", 10, 11, "insertion", ... attributes=[("foo", "1"), ("bar", "2")]) >>> record.start 10 >>> record.foo '1' >>> record.baz = 3 >>> del record.baz
Attribute access using record.fieldName notation raises
ValueError
if an attribute named fieldName doesn’t exist. Use:>>> record.get(fieldName)
to fetch a field or attribute with None default or:
>>> record.get(fieldName, defaultValue)
to fetch the field or attribute with a custom default.
-
copy
()¶ Return a shallow copy
-
classmethod
fromString
(s)¶ Parse a string as a GFF record. Trailing whitespace is ignored.
-
-
class
pbcore.io.
GffReader
(f)¶ A GFF file reader class
-
class
pbcore.io.
GffWriter
(f)¶ A GFF file writer class