API Documentation
Interval
- class Interval(chromosome, strand, start, end, reference_genome, anchor=None, anchor_offset=0)[source]
A genomic interval.
An Interval represents a contiguous stranded genomic interval, identified by a chromosome, strand, start position, end position, and reference genome.
- __init__(chromosome, strand, start, end, reference_genome, anchor=None, anchor_offset=0)[source]
Initialize an interval from a DNA0 position.
- Parameters:
chromosome (
str
) – The chromosome name (‘chr1’, …).strand (
chr
) – The strand (“+” or “-“).start (
int
) – The start position of the interval (inclusive, 0-based).end (
int
) – The end position of the interval (exclusive, 0-based).reference_genome (
str
|Genome
) – The reference genome in which the interval is defined, for example “hg19”. If a genome object is given, its reference genome will be used.anchor (
int
|str
|Interval
| None, optional) – The anchor is only relevant when extracting DNA from a VariantGenome and specifies how length-changing variants are applied. Conceptually, the anchor defines a fixed point when the sequence changes length. See (TODO: link) for the details. The anchor can be any position in the interval. If anchor is None, then length-changing variants will change the length of the interval. Anchor can be a string: 5p or 3p sets the anchor to the respective end of the interval, respecting the sense strand. start or end set the anchor to the start or end of the interval on the forward strand. When anchor is center then the anchor is set to the middle position of the interval (if the interval has odd length, the position is rounded towards the 5p end of the interval, respecting the sense strand). If anchor is an integer or an empty interval, the anchor is set to that position.anchor_offset (
int
, optional) – The anchor offset is a shift that’s applied to the anchor only on the variant genome. The purpose is to align a position on the reference genome with a position inside an insertion. The anchor itself would only allow to align with the beginning or end of the insertion.
- property chromosome
The chromosome name.
A shorthand property chrom is also available.
- Returns:
The name of the chromosome this interval is defined on (e.g. “chr1”).
- Return type:
- property strand
The strand (“+” or “-“).
- Returns:
The strand this interval is defined on.
- Return type:
chr
- property start
The integer position of the start (inclusive, 0-based).
- Returns:
The start position of this interval within the reference genome.
- Return type:
- property end
The integer position of the end (exclusive, 0-based).
- Returns:
The end position of the this interval within the reference genome.
- Return type:
- property end5
The 5p end of this interval, which is a length-0 interval.
Not to be confused with the end attribute, which
- Returns:
The length-0 interval at the 5p end of the interval, i.e. the gap preceding the most upstream base in the interval.
- Return type:
- property end3
The 3p end of this interval, which is a length-0 interval.
- Returns:
The length-0 interval at the 3p end of the interval, i.e. the gap succeeding the most downstream base in the interval.
- Return type:
- property reference_genome
The reference genome (assembly name in UCSC format).
A shorthand property refg is also available.
- Returns:
The name of the reference genome this interval is defined in.
- Return type:
- property anchor
The anchor position.
- property anchor_offset
The anchor offset.
- Returns:
The anchor offset associated with this interval, if any.
- Return type:
- expand(upstream, dnstream=None)[source]
The interval expanded upstream/downstream.
Can also be called with a single argument, in which case upstream and dnstream take on the same value.
- Parameters:
- Returns:
An interval containing upstream + downstream + 1 positions.
- Return type:
- subtract(other: Interval) Sequence[Interval] [source]
A list of intervals, representing this interval with the its intersection of another interval removed.
- Parameters:
other – interval to subtract its intersection from this interval.
- Returns:
The intervals after subtraction. If the intervals were disjoint, returns [self].
- Return type:
Sequence[Interval]
- is_positive_strand()[source]
Test if this interval is on the positive strand.
- Returns:
True if strand is equal to ‘+’
- Return type:
- as_positive_strand()[source]
The same interval on the positive strand.
- Returns:
The same interval, but on the positive (‘+’) strand.
- Return type:
- as_negative_strand()[source]
The same interval on the negative strand.
- Returns:
The same interval, but on the negative (‘-’) strand.
- Return type:
- as_opposite_strand()[source]
The same interval on the opposite strand.
- Returns:
The same interval, but on the opposite strand.
- Return type:
- within(other)[source]
Test if this interval lies within the extents of other.
Use of
interval in other
expressions is also available.
- static from_dna0_coord(chromosome, strand, position, reference_genome)[source]
Create an interval spanning a single DNA0 position.
Equivalent to
Interval(chromosome, strand, position, position+1, reference_genome)
.- Parameters:
chromosome (
str
) – The chromosome name (e.g. “chr1”).strand (
int
) – The strand (“+” or “-“).position (
int
) – The position the interval should span (0-based).reference_genome (
str
|Genome
) – The reference genome in which the interval is defined, for example “hg19”. If a genome object is given, its reference genome will be used.
- Returns:
The resulting length-1 interval spanning position.
- Return type:
- static from_dna0(chromosome, strand, start, end, reference_genome)[source]
Create an interval from DNA0 positions.
Equivalent to calling
Interval
.- Parameters:
chromosome (
str
) – The chromosome name (e.g. “chr1”).strand (
int
) – The strand (“+” or “-“).start (
int
) – The start position of the interval (inclusive, 0-based).end (
int
) – The end position of the interval (exclusive, 0-based).reference_genome (
str
|Genome
) – The reference genome in which the interval is defined, for example “hg19”. If a genome object is given, its reference genome will be used.
- Returns:
The resulting interval.
- Return type:
- static from_dna1(chromosome, strand, start, end, reference_genome)[source]
Create an interval from DNA1 positions.
Positions are automatically converted to DNA0 as (start-1, end).
- Parameters:
chromosome (
str
) – The chromosome name (e.g. “chr1”).strand (
int
) – The strand (“+” or “-“).start (
int
) – The start position of the interval (inclusive, 1-based).end (
int
) – The end position of the interval (inclusive, 1-based).reference_genome (
str
|Genome
) – The reference genome in which the interval is defined, for example “hg19”. If a genome object is given, its reference genome will be used.
- Returns:
The resulting interval.
- Return type:
- static from_rna1(chromosome, start, end, reference_genome)[source]
Create an interval from RNA1 positions.
The start/end positions are automatically converted to DNA0 as min(abs(start),abs(end))-1 and max(abs(start), abs(end))) respectively.
- Parameters:
chromosome (
str
) – The chromosome name (e.g. “chr1”).strand (
int
) – The strand (“+” or “-“).start (
int
) – The start position of the interval (signed, inclusive, 1-based).end (
int
) – The end position of the interval (signed, inclusive, 1-based).reference_genome (
str
|Genome
) – The reference genome in which the interval is defined, for example “hg19”. If a genome object is given, its reference genome will be used.
- Returns:
The resulting interval.
- Return type:
- static spanning(interval1, interval2)[source]
Create the smallest interval spanning the given pair of existing intervals, or raise an exception if given intervals not on the same chromosome, strand, and reference genome. The given intervals are not required to be overlapping, and may both be of zero length.
Example:
>>> a = Interval("chr1", "+", 100, 150, "hg19") >>> b = Interval("chr1", "+", 200, 250, "hg19") >>> Interval.spanning(a, b) Interval("chr1", "+", 100, 250, "hg19")
Notes
This method is not anchor aware, since that requires a list of variants for context. See
spanning()
for that use case.
- property midpoint
Identify the midpoint of an interval.
Example:
>>> Interval("chr1", "+", 100, 150, "hg19").midpoint Interval("chr1", "+", 125, 125, "hg19") >>> Interval("chr1", "+", 100, 151, "hg19").midpoint Interval("chr1", "+", 125, 126, "hg19")
- Returns:
The interval spanning the midpoint of the interval.
- Return type:
- distance(other, method='midpoint')[source]
The distance this interval and another.
Example:
>>> interval_a = Interval("chr1", "+", 100, 150, "hg19") >>> interval_b = Interval("chr1", "+", 200, 300, "hg19") >>> interval_c = Interval("chr1", "+", 400, 500, "hg19") >>> interval_a.distance(interval_b) 125.0 >>> interval_a.distance([interval_b, interval_c]) [125.0, 325.0]
- Parameters:
- Returns:
Distance to the interval. If other is of type Interval, return distance as a float; otherwise, a list of distances.
- Return type:
- as_dna0()[source]
Returns (start, end) positions using DNA0 convention.
- Returns:
The positions
(start, end)
using DNA0 convention.- Return type:
- as_dna1()[source]
Returns (start, end) positions using DNA1 convention.
- Returns:
The positions
(start, end)
using DNA1 convention.- Return type:
- as_rna1()[source]
Returns (start, end) positions using RNA1 convention.
- Returns:
The positions
(start, end)
using RNA1 convention.- Return type:
- as_ucsc()[source]
Returns this interval as a UCSC browser coordinate.
- Returns:
A string in UCSC browser convention, e.g.
"chr5:500201-5002010"
.- Return type:
Variant
- class Variant(chromosome, start, ref, alt, reference_genome)[source]
A variant.
Bases:
Interval
- __init__(chromosome, start, ref, alt, reference_genome)[source]
Initialize a variant from a DNA0 position.
- property interval
The interval spanned by this variant in the reference genome.
Note that
Variant
also inherits fromInterval
.- Returns:
The interval spanned by this variant’s ref sequence in the reference genome.
- Return type:
- property position
The integer position of the start (inclusive, 0-based).
This property is a synonym for start and is provided for historical reasons. For consistency with other interval-based objects, prefer the start property.
- Returns:
The start position of this interval within the reference genome.
- Return type:
- property ref
The reference allele as a DNA string.
- Returns:
The sequence of the reference allele that is altered by this variant.
- Return type:
- property alt
The alternate allele as a DNA string.
- Returns:
The sequence of the alternate allele that this variant represents.
- Return type:
- as_variant_string()[source]
Returns a string representation of the variant.
See also
- Returns:
The variant in the format
"chromosome:start:ref:alt"
, where start is in DNA1 coordinates.- Return type:
- __str__()
Returns a string representation of the variant.
See also
- Returns:
The variant in the format
"chromosome:start:ref:alt"
, where start is in DNA1 coordinates.- Return type:
- static from_string(variant, reference_genome)[source]
Initialize a variant object from a variant string.
- Parameters:
See also
- static spanning(x, y, variants=None)[source]
Extends
spanning()
such that it is aware of variants and anchors.Example:
>>> from genome_kit import Genome, Variant, Interval >>> hg19 = Genome("hg19") >>> v = Variant("chr1", 100, "", 10 * "N", hg19) >>> a = Interval("chr1", "+", 100, 150, hg19, 100) >>> b = Interval("chr1", "+", 200, 250, hg19) >>> Interval.spanning(a, b) Interval("chr1", "+", 100, 250, "hg19") >>> Variant.spanning(a, b, [v]) Interval("chr1", "+", 100, 260, "hg19", 100)
Genome
- class Genome(config)[source]
Initialize a reference genome and its corresponding resources (DNA, annotations, etc).
A genome object represents a reference genome and resources available for it, such as DNA and annotations:
>>> genome = Genome("gencode.v19") >>> exon = genome.exons[0] # 1st annotated exon. >>> dna = genome.dna(exon) # Extract DNA.
The initializer argument specifies a versioned resource:
# Access resources that are hg19 Genome("hg19") # GENCODE v29 implies hg38 Genome("gencode.v29")
The currently recognized strings depend on the data made available by data_manager. Some common ones are:
reference_genomes = [ "hg19", "hg19.p13.plusMT", "hg38", "hg38.p12", "hg38.p13", "hg38.p14", "mm10.p6", "mm39", "rn6", "macFas5", "susScr11", ] annotations = [ "gencode.v19", # implies hg19.p13.plusMT "gencode.v29", # implies hg38.p12 "gencode.v29.basic", # implies hg38.p12 "gencode.v29lift37", # implies hg19 "gencode.v29lift37.basic", # implies hg19 "gencode.v41", # implies hg38.p13 "gencode.v41.basic", # implies hg38.p13 "gencode.v41lift37", # implies hg19 "gencode.v41lift37.basic", # implies hg19 "ucsc_refseq.2017-06-25", # implies hg19 "ncbi_refseq.v105.20190906", # implies hg19.p13.plusMT "ncbi_refseq.v109", # implies hg38.p12 "ncbi_refseq.v110", # implies hg38.p14 "gencode.vM19", # implies mm10.p6 "gencode.vM19.basic", # implies mm10.p6 "gencode.vM30", # implies mm39 "gencode.vM30.basic", # implies mm39 "gencode.vM31", # implies mm39 "gencode.vM31.basic", # implies mm39 "ncbi_refseq.m39.v109", # implies mm39 "ensembl.Rnor_6.0.88", # implies rn6 "ncbi_refseq.Macfas_5.0.v101", # implies macFas5 "ensembl.Macfas_5.0.95", # implies macFas5 "ensembl.Sscrofa11.1.98", # implies susScr11 ]
If a resource is requested after object creation, but the version needed cannot be disambiguated from config, an exception is raised.
- Parameters:
config (
str
) – Identifies a versioned resource that this genome object is expected to serve.
- interval(chromosome, strand, start, end)[source]
Shorthand method for creating an Interval with the same refg as the genome object called on.
- variant(variant_string)[source]
Shorthand method for creating a Variant with the same refg as the genome object called on.
- dna(interval, allow_outside_chromosome=True)[source]
Extract DNA from a reference genome.
If interval is on the negative strand, the reverse complement sequence is returned. If interval is outside the range of the chromosome, an
IndexError
is raised.- Parameters:
interval (
Interval
) – The query interval.allow_outside_chromosome (
bool
) – If False, does not allow the interval to be outside the range of the chromosome. Attempting to pass an interval outside the chromosome range will raise anIndexError
. If not specified, defaults to True. Any base outside the chromosome range will be returned as an ‘N’.
- Returns:
The DNA sequence.
- Return type:
- variant_dna(interval, variants, allow_outside_chromosome=True)[source]
Extract DNA from a genome with specific variants applied to it.
Follows the same rules as
dna()
. Intended for applying sets of variants that may overlap interval.
- property annotation
Access to annotations.
A shorthand property anno is also available.
- Returns:
An object that provides access to annotated genes, transcripts, and exons.
- Return type:
- property genes
Access to annotated genes.
This is shorthand for annotation.genes.
- Returns:
An object that supports looping or queries over annotated genes.
- Return type:
- property transcripts
Access to annotated transcripts.
This is shorthand for annotation.transcripts.
A shorthand property trans is also available.
- Returns:
An object that supports looping or queries over annotated transcripts.
- Return type:
- property exons
Access to annotated exons.
This is shorthand for annotation.exons.
- Returns:
An object that supports looping or queries over annotated exons.
- Return type:
- property introns
Access to annotated introns.
This is shorthand for annotation.introns.
- Returns:
An object that supports looping or queries over annotated introns.
- Return type:
- property cdss
Access to annotated coding sequences (CDSs).
This is shorthand for annotation.cdss.
- Returns:
An object that supports looping or queries over annotated coding sequences.
- Return type:
- property utr5s
Access to annotated UTR5 sequences.
This is shorthand for annotation.utr5s.
- Returns:
An object that supports looping or queries over annotated coding sequences.
- Return type:
Utr5Table
- property utr3s
Access to annotated UTR3 sequences.
This is shorthand for annotation.utr3s.
- Returns:
An object that supports looping or queries over annotated coding sequences.
- Return type:
Utr3Table
- find_motif(interval, motif, match_position=0, find_overlapping_motifs=False)[source]
Find a genomic motif in an interval on a
Genome
.- Parameters:
interval (
Interval
) – A genomic interval to search for the motif.motif (
str
) – A genomic motif to search for.match_position (
int
|str
) – Very often you want to find a position after a motif or in the middle of a motif. For example, when you are searching for aAG
core acceptor site motif you want to match the 3p end of theAG
motif to return the position of the putative splice site. To do this you can specify a match position as string with the values"3p"
or"5p"
, or you can pass an integer that indicates a position in the motif with0 <= match_position <= len(motif)
.find_overlapping_motifs (
bool
) – Whether to return matches for overlapping motifs. The default is to only find non-overlapping motifs.
- Returns:
A list of all matches (length-0 intervals) found in the query interval.
- Return type:
Example
>>> genome37 = Genome('hg19') >>> interval = genome37.interval('chr1', '+', 40000, 40080) >>> motif = 'AA' >>> genome37.find_motif(interval, motif) [Interval("chr1", "+", 40031, 40031, "hg19"), Interval("chr1", "+", 40061, 40061, "hg19")]
- property reference_genome
The name of the reference genome.
- Returns:
The name of the reference genome that any resources map to, e.g. “hg19” or “hg38”.
- Return type:
- property chromosomes
A list of valid chromosome names for this genome.
- chromosome_size(chromosome)[source]
Returns the size of the given chromosome, in bp.
- Returns:
The size of the given chromosome, as number of basepairs.
- Return type:
- property config
The configuration strings of this genome.
- Returns:
The config used to initialize this object.
- Return type:
- property data_dir
The data_dir from which resources are opened.
- Returns:
The path to where resource files are being opened.
- Return type:
- appris_transcripts(gene=None)[source]
Returns a list of transcript objects ordered by our custom APPRIS index. If a gene is provided, returns a list of transcripts for that gene ordered by decreasing priority.
- Parameters:
gene (
Gene
|str
, optional) – Restrict to a specific gene. If using str, use the geneID.- Returns:
A list of transcripts that have APPRIS scores available. If a gene is provided as input, a list of transcripts for that gene in the order from most to least principle; so the first transcript should be used as the canonical.
- Return type:
list
ofTranscript
- Raises:
ApprisNotAvailableError: – APPRIS data is not available for this genome’s annotations.
- appris_principality(transcript, as_string=False)[source]
Returns the APPRIS principality index (isoform specific number) for this transcript. An APPRIS principality index is an integer from 0..6 where 0 is considered most prinicpal and 6 considered least principal. In the original APPRIS database, the 7 principality levels are reported using strings, in the following order:
appris_index_names = [ "PRINCIPAL:1", # index 0 "PRINCIPAL:2", # index 1 "PRINCIPAL:3", # index 2 "PRINCIPAL:4", # index 3 "PRINCIPAL:5", # index 4 "ALTERNATIVE:1", # index 5 "ALTERNATIVE:2", # index 6 ]
See the APPRIS database documentation for details.
The APPRIS version defaults to 2017_05.v22 most annotations; see _build_appris.REMOTE_FILES for the rest.
- Parameters:
transcript (
Transcript
|str
) – A transcript to query.as_string (
bool
, optional) – Toggle True if the APPRIS principality string (e.g. “PRINCIPAL:1”) is desired.
- Returns:
The APPRIS principality index [0..6] of this transcript, or the string representation if as_string=True
- Return type:
- Raises:
ApprisNotAvailableError: – APPRIS data is not available for this genome’s annotations.
- __weakref__
list of weak references to the object
VariantGenome
- class VariantGenome(reference_genome, variants)[source]
A reference genome with a number of variants applied to it.
A variant genome behaves like a
Genome
object, object, except any requests for DNA sequence (or quantities that depend on DNA sequence) will be returned with the variants already applied. If the variants change the length of the sequence, then the interval’s anchor and anchor_offset determine how the query interval will be aligned to the variant genome’s coordinate system.TODO: see explanation of anchors.
- __init__(reference_genome, variants)[source]
Initialize a variant genome.
- Parameters:
reference_genome (
Genome
) – The reference genome to which variants are applied.variants (
Variant
|list
of (Variant
)) – Either a single variant or a list of variants. Aligned-normalization of variants is not yet implemented, so the user is currently responsible for not passing variants that overlap. The variants must be provided as aVariant
object. String variants are no longer allowed, please seeVariant
for more information. Since we assume that variants never overlap one another this means that all variant positions are with respect to the reference genome.
- dna(interval, allow_outside_chromosome=True)[source]
Extract variant DNA for an anchored interval or coordinate.
- find_motif(interval, motif, match_position=0, find_overlapping_motifs=False)[source]
Find a genomic motif in an interval on a
VariantGenome
.This method finds all occurrences of a genomic motif on a VariantGenome. It can handle all kinds of variants including indels and complex variants. When a motif is found within an insertion, it will set the
anchor_offset
attribute on the returnedInterval
.- Parameters:
interval (
Interval
) – A genomic interval to search for the motif.motif (
str
) – A genomic motif to search for.match_position (
int
|str
) – Very often you want to find a position after a motif or in the middle of a motif. For example, when you are searching for aAG
core acceptor site motif you want to match the 3p end of theAG
motif to return the position of the putative splice site. To do this you can specify a match position as string with the values"3p"
or"5p"
, or you can pass an integer that indicates a position in the motif with0 <= match_position <= len(motif)
.find_overlapping_motifs (
bool
) – Whether to return matches for overlapping motifs. The default is to only find non-overlapping motifs.
- Returns:
A list of all matches (length-0 intervals) found in the query interval.
- Return type:
py:class`list` of
Interval
Example
>>> from genome_kit import Genome, Variant >>> genome = Genome('hg19') >>> variant = Variant.from_string("chr1:40033:A:G", genome) >>> variant_genome = VariantGenome(genome, variant) >>> interval = genome.interval('chr1', '+', 40000, 40080) >>> motif = 'TAG' >>> variant_genome.find_motif(interval, motif) [Interval("chr1", "+", 40030, 40030, "hg19", 40030)]
Tracks
GenomeDNA
- class GenomeDNA[source]
Access to the DNA sequence of a reference genome.
This object should be accessed through the dna property of
Genome
:>>> from genome_kit import Genome, Interval >>> interval = Interval("chr5", "+", 50000, 50010, "hg19") >>> genome = Genome("hg19") >>> genome.dna(interval) 'TAAACCACAT'
- __call__(interval, allow_outside_chromosome=True)[source]
Extract DNA from a reference genome.
If interval is on the negative strand, the reverse complement sequence is returned. If interval is outside the range of the chromosome, an
IndexError
is raised.- Parameters:
interval (
Interval
) – The query interval.allow_outside_chromosome (
bool
) – If False, does not allow the interval to be outside the range of the chromosome. Attempting to pass an interval outside the chromosome range will raise anIndexError
. If not specified, defaults to True. Any base outside the chromosome range will be returned as an ‘N’.
- Returns:
The DNA sequence.
- Return type:
- property reference_genome
The name of the reference genome.
- Returns:
The name of the reference genome from which DNA will be extracted, e.g. “hg38”.
- Return type:
- property filename
The path to the file from which DNA is extracted.
- Returns:
The path to the file, e.g. “/data/hg19.2bit”
- Return type:
GenomeTrack
- class GenomeTrack(infile)[source]
Access to a genomic data track.
A
GenomeTrack
provides a read-only view of a .gtrack file. A track can be easily opened and queried:>>> interval = Interval('chr17', '+', 41246881, 41246886, 'hg19') >>> track = GenomeTrack("hg19.phastcons_mammal.f4.gtrack") >>> track(interval) array([[ 1. ], [ 0.857], [ 0.571], [ 0.357], [ 0.286]], dtype=float16) >>> track(interval.as_opposite_strand()) array([[ 0.286], [ 0.357], [ 0.571], [ 0.857], [ 1. ]], dtype=float16)
You can also open a track with a specific scope:
>>> with GenomeTrack("foo.gtrack") as track: ... values = track(interval)
Tracks can be encoded in formats called etypes. The available etypes trade representation power for storage size and query speed:
etypes = [ # DECODABLE TYPES RANGE ENCODED TYPE "m0", # float16/32, uint8, bool {1} 0-bit (mask) "u1", # float16/32, uint8, bool {0..1} 1-bit "u2", # float16/32, uint8 {0..3} 2-bit "u3", # float16/32, uint8 {0..7} 3-bit "u4", # float16/32, uint8 {0..15} 4-bit "u5", # float16/32, uint8 {0..31} 5-bit "u6", # float16/32, uint8 {0..63} 6-bit "u8", # float16/32, uint8 {0..255} 8-bit "i8", # float16/32, int8 {-128..127} 8-bit "f2", # float16/32 [-65504,65504] 2-bit dictionary "f3", # float16/32 [-65504,65504] 3-bit dictionary "f4", # float16/32 [-65504,65504] 4-bit dictionary "f5", # float16/32 [-65504,65504] 5-bit dictionary "f6", # float16/32 [-65504,65504] 6-bit dictionary "f8", # float16/32 [-65504,65504] 8-bit dictionary "f16", # float16/32 [-65504,65504] 16-bit float "f32", # float16/32 [-1.2e-38, 3.4e38] 32-bit float ]
The default dictionary for f2..8 types is the range [0,1] computed as
np.linspace(0.0, 1.0, 2**num_bits, dtype=np.float16)
.Decoded types are called dtypes, and are indeed represented by numpy dtypes:
dtypes = [ np.bool_, # {False, True} np.uint8, # {0..255} np.int8, # {-128..127} np.float16, # [-65504, 65504] IEEE 754 half-precision np.float32, # [-1.2e-38, 3.4e38] ]
New tracks can be can created using
GenomeTrackBuilder
.- __call__(interval, dtype=None, out=None)[source]
Extract data from a genomic track.
Data is always returned in sense-strand order, meaning it is always sensitive to the strand of the interval, even on an unstranded track.
You can optionally specify the dtype that a track should be decoded to.
- Parameters:
interval (
Interval
) – The stranded query interval.dtype (
type
) – Optional. The numpy dtype of the resulting array. Ignored ifout
is provided.out (
ndarray
) – Optional. A numpy array to hold the result. If provided, it overridesdtype
and it must have a shape that the inputs broadcast to. If not provided or None, a freshly-allocated array is returned.
- Returns:
The decoded track data with
len(interval)
rows anddim
columns.- Return type:
- property dim
The dimensionality of the track.
- Returns:
The number of columns in each returned array.
- Return type:
- property resolution
The resolution of the track.
- Returns:
The number of genomic positions spanned by each datum.
- Return type:
- property stranded
The strandedness of the track.
- Returns:
Whether the negative strand stores data unique from the positive strand.
- Return type:
- property etype
The encoding type of the track.
- Returns:
A string identifying the encoding type.
- Return type:
- property dtype
The default dtype for the track.
- Returns:
The numpy dtype that this track will decode to by default.
- Return type:
- property reference_genome
The name of the reference genome this track is mapped to.
- Returns:
The name of the reference genome, e.g. “hg38”.
- Return type:
- property filename
The path to .gtrack file from which data is extracted.
- Returns:
The path to the file, e.g. “/data/phastcons.gtrack”
- Return type:
- histogram(region, separate_dims=False)[source]
Generate a histogram of the values in the track.
Returns a dictionary where each key is a float appearing at least once in the track, and each value is a count of the number of times it occurred. NaN values are omitted.
If region is a
Genome
, computes a genome-wide histogram. If region is an interval or list of intervals, computes the histogram over that region only.If separate_dims is true, returns a list of
dim
histograms, one for each track dimension.- Parameters:
- Returns:
A dictionary
{ value : count }
where the value appearing in the track count times. Zero- Return type:
- close()[source]
Close the file handle.
Note that if you use a with statement the file handle is automatically closed:
# Open the file with GenomeTrack('foo.gtrack') as track: ... # <-- File is now closed
GenomeTrackBuilder
- class GenomeTrackBuilder(outfile, etype, strandedness, reference_genome, dim=1, resolution=1)[source]
Builds a .gtrack file.
Creating a .gtrack file with
GenomeTrackBuilder
generally involves four steps:Open the file for writing.
Set further configuration options.
Set data on intervals, either as numpy arrays or from disk.
Call
finalize()
.
For example, assuming we had a list of genomic intervals and corresponding data arrays we wanted to fill them with:
# 1. Open a .gtrack 8-bit encoding of floats in range [0, 1] track = GenomeTrackBuilder("foo.gtrack", "f8", "single_stranded", Genome("hg19")) # 2. Configure the track further track.set_default_value(0) # Fill gaps in the data with 0 # 3. Set data for each interval for interval, data in entries: track.set_data(interval, data) # 4. Finish writing the file track.finalize()
Tracks can have any dimension (the dim argument) or resolution (the resolution argument). Resolution allows track data to be specified and stored at coarser than 1bp resolution.
For example, a track with resolution=5 can load a WIG file with step/span of 5bp:
fixedStep chrom=chr1 start=1 step=5 span=5 0.57 0.23 ...
The resulting track is equivalent to the following:
0 1 2 3 4 5 6 7 8 9 ... <-- position 0.57 0.57 0.57 0.57 0.57 0.23 0.23 0.23 0.23 0.23 ... <-- decoded value
- __init__(outfile, etype, strandedness, reference_genome, dim=1, resolution=1)[source]
Open a .gtrack for writing.
- Parameters:
outfile (
str
) – The .gtrack file to create.etype (
str
) – The encoding format. SeeGenomeTrack
for details.strandedness (
str
) –Determines the order by which the data is applied.
"single_stranded"
: both strands share the same data. The data is applied in Interval coordinate (reference strand) order."strand_unaware"
: ignores the Interval strand, data is applied in Interval coordinate (reference strand) order."strand_aware"
: data is applied from 5” end to 3” end (sense strand order).
Negative strand data from bedgraph or wig files must be provided from a separate file. When calling
set_data_from_bedgraph()
orset_data_from_wig()
, set"strand_unaware"
.reference_genome (
Genome
) – The reference genome for the track data to build.dim (
int
) – Optional. The number of dimensions in the resulting track.resolution (
int
) – Optional. The resolution of the track data, in genomic positions.
- set_default_value(value)[source]
Set the fill value for gaps in the track.
The default value does not need to be encodable. For example, a default value of -1.0 for a dictionary that does not contain -1.0 is fine.
- Parameters:
value (
float
) – The value to fill gaps with.
- set_sparsity(min_run=48, min_delta=0.0)[source]
Enable automatic sparsification of data intervals.
Reduces file size of tracks that are dominated by default_value, with non-default values appearing at sparse intervals. Highly recommended for variableStep WIG files.
The default behaviour of a track is to encode all data passed to
set_data()
. After callingset_sparsity()
, the encoder will analyze the data arrays passed in and avoid encoding runs of default_value.Specifically, if there exists a run of at least min_run values such that each
abs(data[i] - default_value) <= min_delta
, then that entire run will be treated as default_value and not explicitly stored in the file.Note that, in order to exclude a run from the track, a 12-byte entry must be addd to the track index, so setting min_run to very small values can actually increase file size and will definitely increase query time.
- set_clamping()[source]
Clamp all values to the encodable range.
By default, a value outside the encodable range will trigger an error, to notify the user that his/her data may not be encoded properly by the current format.
Calling this method shows intent that values should be clamped to the encodable range. Currently allowed only for f-type encodings.
- set_dict(dict)[source]
Set the dictionary for etypes f2..8.
The dictionary must be a float16/32 numpy array. For etype f<n>, there must be 2**n entries in the array. Numerical values must be in non-decreasing order. The last entry in dict can be
np.nan
if you wish to be able to encode NaN at specific positions.- Parameters:
dict (
np.ndarray
) – The array of values comprising the dictionary.
- set_transform(a, b, c, d)[source]
Transform data from range [a, b] to range [c, d].
This specifies that any time data is about to be encoded, it will undergo an affine transformation from numeric range [a, b] to range [c, d].
- set_restriction(restriction)[source]
Restrict data to a specific interval.
The restriction interval acts as a allowed region, so that all data outside that region is either ignored or cropped away by
set_data()
.The restriction interval’s strand is ignored, i.e. the restriction always allows data on either strand.
This method is useful for making small version of full-sized data pipeline, for the sake of creating unit tests or iterating faster during development.
If the restriction interval is not aligned with the track resolution, some then the restriction interval will be expanded until it is aligned.
- Parameters:
restriction (
Interval
) – The interval on which to keep data.
- set_data(interval, data)[source]
Set the track data on a specific interval.
If the interval overlaps a previously-specified interval, an error is raised.
The dtype of the data array must be compatible with the etype of the track being built.
Normally, data must be an NxM array where N is the length of interval and M is dim.
If track resolution R is greater than 1, then data must be (N/R)xM where N is the length of interval. The start and end of interval must be aligned to R. In other words, data is supplied (and stored) at a coarsened resolution.
- Parameters:
interval (
Interval
) – The interval to fill with data.data (
np.ndarray
) – The data array, ordered according to the strandedness argument passed to the builder’s constructor.
- set_data_from_wig(pos_strand_file, neg_strand_file=None)[source]
Load all data from a .wig or .wig.gz file.
Loads all data from a fixedStep or variableStep WIG file.
Set
strandedness="strand_unaware"
in__init__()
.The number of data columns in the WIG must match the track dimension. The span of the data must match the track resolution, with the exception of the very last datum on a chromosome.
Adjacent WIG intervals will automatically be merged into a single GTRACK interval, with different data encoded for each spanned position. Consider calling
set_sparsity()
.
- set_data_from_bedgraph(pos_strand_file, neg_strand_file=None)[source]
Load all data from a .bedgraph or .bedgraph.gz file.
Loads all data from a BEDGRAPH file. A direct analogue of
set_data_from_wig()
.Set
strandedness="strand_unaware"
in__init__()
.
- set_data_from_bed(bedfile, categories=None)[source]
Load all data from a .bed or .bed.gz file.
Loads each interval from a BED file and incorporates it into the track.
If categories is not specified, then the value contained in the score column of the BED file will be stored in the corresponding interval.
If categories is specified, then the name column of the BED file determines the value stored on each interval. The interval’s name will be used to find an index in categories, and that index will be stored.
The strandedness of the track must match the BED file (i.e. an unstranded track expects the strand column to contain
'.'
). The track dimension must be 1. The start and end of each interval must match the track resolution, with the exception of the very last position of a chromosome.Adjacent BED intervals will automatically be merged into a single contiguous GTRACK interval. In a GTRACK file, the data value of an interval is repeated for the size of the interval, unlike in a BED file where it’s specified only once per interval. This can result in large GTRACK files, especially when at 1bp resolution (the default). Consider using
set_default_value()
andset_sparsity()
to manage file size.
- flush()[source]
Flush the currently accumulated data to disk.
This function can free up memory while building a track, but should only be called immediately after an entire chromosome has had its data set. Once this is called all chromosomes that currently have any data on them become fixed.
- finalize()[source]
Finish writing the file and close it.
This function must be called last, to complete creation of the GTRACK file.
- property dim
The dimensionality of the track.
- Returns:
The number of columns in each returned array.
- Return type:
- property resolution
The resolution of the track.
- Returns:
The number of genomic positions spanned by each datum.
- Return type:
- property stranded
The strandedness of the track.
- Returns:
Whether the negative strand stores data unique from the positive strand.
- Return type:
- property etype
The encoding type of the track.
- Returns:
A string identifying the encoding type.
- Return type:
- property dtype
The default dtype for the track.
- Returns:
The numpy dtype that this track will decode to by default.
- Return type:
- property reference_genome
The name of the reference genome the track will be mapped to.
- Returns:
The name of the reference genome, e.g. “hg38”.
- Return type:
- property filename
The path to .gtrack file being written to.
- Returns:
The path to the file, e.g. “/data/phastcons.gtrack”
- Return type:
- property data_size
The number of positions spanned by data in the GTRACK file.
If sparsity is disabled, the data size is the number of positions spanned by data via
set_data()
.Enabling sparsity may decrease the data size.
Using resolution > 1bp will also decrease the data size, since data is encoded into a coarse coordinate system, not at full genomic resolution.
The number of bits that the data occupies in the GTRACK file is approximately data_size * dim * bits_per_datum, but the exact computation is non-trivial, so portion of GTRACK file dedicated to data may be higher or lower than this rough estimate.
- Returns:
The number of data spanned by encoded data.
- Return type:
- property index_size
The number of separate intervals in the GTRACK index.
If sparsity is disabled, the index size is the number of times
set_data()
was called.Enabling sparsity may increase the index size.
- Returns:
The number of unique intervals in the index.
- Return type:
Annotations
GenomeAnnotation
- class GenomeAnnotation[source]
Annotations for a reference genome.
This object should not be directly created. Instead, create a
Genome
object:genome = Genome("gencode.v19") for gene in genome.annotation.genes: ...
The
Genome
object also exposes shortcut attributes, for convenience:for gene in genome.genes: ...
See
Genome
for a list of annotations available, e.g.gencode.v29
,ucsc_refseq.2017-06-25
, etc.Note
UCSC RefSeq can contain multiple versions of the same gene/transcript. These tend to be in ambiguously mapped regions. In GenomeKit, they’re retained as separate entries in the gene and transcript tables, with distinct intervals but otherwise identical.
For example, in GENCODE v19 the SMN1/2 genes are located on chromosome 5 at 70220767-70249769 and 69345349-69374349 respectively. In UCSC RefSeq, SMN1 is duplicated at both 70220767-70248838 and 69345349-69373418, so there are two gene entries named SMN1, each with their own version of transcripts NM_000344, NM_001297715, NM_022874. Similarly, SMN2 is duplicated at both 70220767-70248842 and 69345349-69373422.
- property genes
An indexed table of annotated genes.
- Returns:
An object that supports looping or queries over annotated genes.
- Return type:
- property transcripts
An indexed table of annotated transcripts.
A shorthand property trans is also available.
- Returns:
An object that supports looping or queries over annotated transcripts.
- Return type:
- property exons
An indexed table of annotated exons.
- Returns:
An object that supports looping or queries over annotated exons.
- Return type:
- property introns
An indexed table of annotated introns.
- Returns:
An object that supports looping or queries over annotated introns.
- Return type:
- property cdss
An indexed table of annotated coding sequences (CDSs) within exons.
This is shorthand for annotation.cdss.
- Returns:
An object that supports looping or queries over annotated coding sequences.
- Return type:
- property utr5s
An indexed table of UTR5s within exons.
This is shorthand for annotation.utr5s.
- Returns:
An object that supports looping or queries over annotated coding sequences.
- Return type:
UtrTable
- property utr3s
An indexed table of UTR3s within exons.
This is shorthand for annotation.utr3s.
- Returns:
An object that supports looping or queries over annotated coding sequences.
- Return type:
UtrTable
- property filename
The path to the file from which annotations are retrieved.
- Returns:
The path to the file, e.g. “/data/gencode.v19.annotation.dganno”
- Return type:
- static binary_version()[source]
The binary file format version that this build supports.
- Returns:
The version number.
- Return type:
- static build_gencode(infile, outfile, reference_genome)[source]
Build a binary version of GENCODE/Ensembl annotations from a GFF3 file.
Currently supports conversion from gff3[.gz].
In addition to the versioned dganno file, a cfg file will also be created.
- static build_ncbi_refseq(infile, outfile, reference_genome)[source]
Build a binary version of NCBI annotations from a GFF3 file.
Currently supports conversion from gff3[.gz].
In addition to the versioned dganno file, a cfg file will also be created.
- static build_ucsc_refseq(ucsc_db_dir, outfile, reference_genome)[source]
Build a binary version of UCSC RefSeq annotations.
The ucsc_db_dir should be a directory that already contains the following files:
refGene.txt[.gz]
refLink.txt[.gz]
These files will be parsed and used to build a dganno file.
In addition to the versioned dganno file, a cfg file will also be created.
- Parameters:
- Returns:
A list of built files created from the UCSC RefSeq files.
- Return type:
Gene
- class Gene[source]
An annotated gene.
Bases:
Interval
- property interval
The interval spanned by this gene.
Note that
Gene
also inherits fromInterval
.- Returns:
The interval spanned by this gene.
- Return type:
- property id
The ID of this gene.
Returns either an Ensembl ID (for GENCODE) or a Entrez ID (for RefSeq) depending on which annotation is being represented.
- Returns:
The ID of this transcript, e.g. “ENSG00000115275.11” or “7841”
- Return type:
- property type
The type of this gene.
This represents the gene_type field in most GFF3 annotations.
For GENCODE/Ensembl/NCBI RefSeq, the result can be any standard GENCODE biotype
For UCSC RefSeq, the result is “protein_coding” if any transcript is protein coding, and otherwise “non_coding”.
- Returns:
The type of this gene, e.g. “protein_coding” or “non_coding”
- Return type:
- property level
The evidence level of this gene.
This represents the level field in most GFF3 annotations, either 1, 2, or 3.
Available for gencode and ensembl only.
- property transcripts
The transcripts for this gene.
A shorthand property trans is also available.
- Returns:
A copy of the list of annotated transcripts for this gene.
- Return type:
list
ofTranscript
GeneTable
- class GeneTable[source]
- find_5p_aligned(interval)[source]
Returns all genes that have 5’ end aligned with 5’ end of interval.
This function is a faster version of the brute-force filter:
[gene for gene in genes if gene.end5 == interval.end5]
The results will also be sorted by 5’.
- find_3p_aligned(interval)[source]
Returns all genes that have 3’ end aligned with 3’ end of interval.
This function is a faster version of the brute-force filter:
[gene for gene in genes if gene.end3 == interval.end3]
The results will also be sorted by 3’.
- find_5p_within(interval)[source]
Returns all genes that have 5’-most position within interval.
This function is a faster version of the brute-force filter:
[gene for gene in genes if gene.end5.expand(0, 1) in interval]
The results will also be sorted by 5’.
- find_3p_within(interval)[source]
Returns all genes that have 3’-most position within interval.
This function is a faster version of the brute-force filter:
[gene for gene in genes if gene.end3.expand(1, 0) in interval]
The results will also be sorted by 3’.
- find_within(interval)[source]
Returns all genes that fall entirely within interval.
This function is a faster version of the brute-force filter:
[gene for gene in genes if gene in interval]
The results will also be sorted by 3’.
- find_overlapping(interval)[source]
Returns all genes that overlap interval.
This function is a faster version of the brute-force filter:
[gene for gene in genes if gene.overlaps(interval)]
The results will also be sorted by 5’.
- find_exact(interval)[source]
Returns all genes that span exactly interval.
This function is a faster version of the brute-force filter:
[gene for gene in genes if gene.interval == interval]
The results will also be sorted by 5’ and then 3’.
- property stranded
If True then the strand is significant when calling the find_x methods.
- Returns:
Whether this table can contain negative stranded intervals.
- Return type:
- __getitem__(index)[source]
Lookup a gene by index or ID string.
If index is a string the gene with matching
id
is returned (by linear search):gene = genes['ENSG00000115275'] # MOGS gene
If the ID string has a version suffix (
'ENSG00000115275.11'
) then the match must be exact.One can also iterate over all genes:
for gene in genes: ...
The results will be ordered as in the source file.
Transcript
- class Transcript[source]
An annotated transcript.
Bases:
Interval
- property interval
The interval spanned by this transcript.
Note that
Transcript
also inherits fromInterval
.- Returns:
The interval spanned by this transcript.
- Return type:
- property id
The ID of this transcript.
Returns either an Ensembl ID or a RefSeq ID depending on which annotation is being represented.
- Returns:
The ID of this transcript, e.g. “ENST00000233616.8” or “NM_006302”
- Return type:
- property tsl
The support level of this transcript.
tsl1: all splice junctions supported by at least one non-suspect mRNA
tsl2: best supporting mRNA is flagged as suspect or support from multiple ESTs
tsl3: only support is from a single EST
tsl4: best supporting EST is flagged as suspect
tsl5: no single transcript supports the model structure
See Ensemble transcript support level.
Available for gencode only.
- property type
The type of this transcript.
This represents the transcript_type field in most GFF3 annotations, which may differ from the parent gene’s type.
For GENCODE/Ensembl/NCBI RefSeq, the result can be any standard GENCODE biotype
For UCSC RefSeq, the result is either “protein_coding” or “non_coding”.
- Returns:
The type of this transcript, e.g. “protein_coding” or “non_coding”
- Return type:
- property level
The evidence level of this transcript.
This represents the level field in most GFF3 annotations, either 1, 2, or 3.
Available for gencode and ensembl only.
- property ccds_id
The CCDSID of this transcript’s coding sequence, if applicable.
Available for gencode only.
- property protein_id
The protein ID of this transcript, if applicable.
Available for gencode, ensembl, ucsc_refseq, and ncbi_refseq.
- property product
A description of the transcript’s product.
Available for ucsc_refseq and ncbi_refseq only.
- property gene
The parent gene of this transcript.
- Returns:
The parent gene for this transcript.
- Return type:
- property exons
The exons for this transcript.
- property introns
The introns for this transcript.
- property cdss
The coding sequences (CDSs) for this transcript.
- property utr5s
The UTR5 segments for this transcript.
- Returns:
A copy of the list of annotated UTR5 segments for this transcript.
- Return type:
list
ofUtr
TranscriptTable
- class TranscriptTable[source]
- find_5p_aligned(interval)[source]
Returns all transcripts that have 5’ end aligned with 5’ end of interval.
This function is a faster version of the brute-force filter:
[tran for tran in transcripts if tran.end5 == interval.end5]
The results will also be sorted by 5’.
- Parameters:
interval (
Interval
) – The query interval.- Returns:
All transcripts that have 5’ end aligned to the 5’ end of interval.
- Return type:
list
ofTranscript
- find_3p_aligned(interval)[source]
Returns all transcripts that have 3’ end aligned with 3’ end of interval.
This function is a faster version of the brute-force filter:
[tran for tran in transcripts if tran.end3 == interval.end3]
The results will also be sorted by 3’.
- Parameters:
interval (
Interval
) – The query interval.- Returns:
All transcripts that have 3’ end aligned to the 3’ end of interval.
- Return type:
list
ofTranscript
- find_5p_within(interval)[source]
Returns all transcripts that have 5’-most position within interval.
This function is a faster version of the brute-force filter:
[tran for tran in transcripts if tran.end5.expand(0, 1) in interval]
The results will also be sorted by 5’.
- Parameters:
interval (
Interval
) – The query interval.- Returns:
All transcripts that have 5’-most position falling entirely within interval.
- Return type:
list
ofTranscript
- find_3p_within(interval)[source]
Returns all transcripts that have 3’-most position within interval.
This function is a faster version of the brute-force filter:
[tran for tran in transcripts if tran.end3.expand(1, 0) in interval]
The results will also be sorted by 3’.
- Parameters:
interval (
Interval
) – The query interval.- Returns:
All transcripts that have 3’-most position falling entirely within interval.
- Return type:
list
ofTranscript
- find_within(interval)[source]
Returns all transcripts that fall entirely within interval.
This function is a faster version of the brute-force filter:
[tran for tran in transcripts if tran in interval]
The results will also be sorted by 3’.
- Parameters:
interval (
Interval
) – The query interval- Returns:
All transcripts that fall entirely within interval.
- Return type:
list
ofTranscript
- find_overlapping(interval)[source]
Returns all transcripts that overlap interval.
This function is a faster version of the brute-force filter:
[tran for tran in transcripts if tran.overlaps(interval)]
The results will also be sorted by 5’.
- Parameters:
interval (
Interval
) – The query interval.- Returns:
All transcripts that overlap interval.
- Return type:
list
ofTranscript
- find_exact(interval)[source]
Returns all transcripts that span exactly interval.
This function is a faster version of the brute-force filter:
[tran for tran in transcripts if tran.interval == interval]
The results will also be sorted by 5’ and then 3’.
- Parameters:
interval (
Interval
) – The query interval.- Returns:
All transcripts that span exactly interval.
- Return type:
list
ofTranscript
- property stranded
If True then the strand is significant when calling the find_x methods.
- Returns:
Whether this table can contain negative stranded intervals.
- Return type:
- __getitem__(index)[source]
Lookup a transcript by index or ID string.
If index is a string the transcript with matching
id
is returned (by linear search):tran = transcripts['ENST00000233616'] # MOGS transcript
If the ID string has a version suffix (
'ENST00000233616.8'
) then the match must be exact.One can also iterate over all transcripts:
for tran in transcripts: ...
The results will be ordered as in the source file.
Exon
- class Exon[source]
An annotated exon.
Bases:
Interval
Each Exon object is equal only to itself. To build a set of unique exonic intervals, use the
interval
attribute:>>> from genome_kit import Genome >>> genome = Genome('gencode.v19') >>> exonic_intervals = set([exon.interval for exon in genome.exons])
- property interval
The interval spanned by this exon.
Note that
Exon
also inherits fromInterval
.- Returns:
The interval spanned by this exon.
- Return type:
- property id
The ID of this exon.
Available for gencode and ensembl only.
- property index
The index of this exon within the transcript.
- Returns:
The index of the exon number within the parent transcript (0-based).
- Return type:
- property transcript
The parent transcript of this exon.
A shorthand property tran is also available.
- Returns:
The parent transcript of this exon.
- Return type:
- property cds
The CDS on this exon, if any.
- property utr5
The UTR5 on this exon, if any.
- Returns:
The UTR5 contained within this exon, if any.
- Return type:
Utr5
|None
- property utr3
The UTR3 on this exon, if any.
- Returns:
The UTR3 contained within this exon, if any.
- Return type:
Utr3
|None
- property next_exon
The next exon in the transcript, if any.
- property prev_exon
The previous exon in the transcript, if any.
- property prev_intron
Returns the previous intron in the transcript, if any.
- property next_intron
Returns the next intron in the transcript, if any.
ExonTable
- class ExonTable[source]
- find_5p_aligned(interval)[source]
Returns all exons that have 5’ end aligned with 5’ end of interval.
This function is a faster version of the brute-force filter:
[exon for exon in exons if exon.end5 == interval.end5]
The results will also be sorted by 5’.
- find_3p_aligned(interval)[source]
Returns all exons that have 3’ end aligned with 3’ end of interval.
This function is a faster version of the brute-force filter:
[exon for exon in exons if exon.end3 == interval.end3]
The results will also be sorted by 3’.
- find_5p_within(interval)[source]
Returns all exons that have 5’-most position within interval.
This function is a faster version of the brute-force filter:
[exon for exon in exons if exon.end5.expand(0, 1) in interval]
The results will also be sorted by 5’.
- find_3p_within(interval)[source]
Returns all exons that have 3’-most position within interval.
This function is a faster version of the brute-force filter:
[exon for exon in exons if exon.end3.expand(1, 0) in interval]
The results will also be sorted by 3’.
- find_within(interval)[source]
Returns all exons that fall entirely within interval.
This function is a faster version of the brute-force filter:
[exon for exon in exons if exon in interval]
The results will also be sorted by 3’.
- find_overlapping(interval)[source]
Returns all exons that overlap interval.
This function is a faster version of the brute-force filter:
[exon for exon in exons if exon.overlaps(interval)]
The results will also be sorted by 5’.
- find_exact(interval)[source]
Returns all exons that span exactly interval.
This function is a faster version of the brute-force filter:
[exon for exon in exons if exon.interval == interval]
The results will also be sorted by 5’ and then 3’.
- property stranded
If True then the strand is significant when calling the find_x methods.
- Returns:
Whether this table can contain negative stranded intervals.
- Return type:
Intron
- class Intron[source]
An annotated intron.
Bases:
Interval
Each Intron object is equal only to itself. To build a set of unique intronic intervals, use the
interval
attribute:>>> from genome_kit import Genome >>> genome = Genome('gencode.v19') >>> intronic_intervals = set([intron.interval for intron in genome.introns])
- property interval
The interval spanned by this intron.
Note that
Intron
also inherits fromInterval
.- Returns:
The interval spanned by this intron.
- Return type:
- property index
The index of the intron within the transcript.
- Returns:
The index of the intron within the parent transcript (0-based).
- Return type:
- property transcript
The parent transcript of this intron.
A shorthand property tran is also available.
- Returns:
The parent transcript of this intron.
- Return type:
- property next_intron
The next intron in the transcript, if any.
- property prev_intron
The previous intron in the transcript, if any.
- property prev_exon
Returns the previous exon in the transcript.
- Returns:
The previous annotated exon on the parent transcript.
- Return type:
- property next_exon
Returns the next exon in the transcript.
- Returns:
The next annotated exon on the parent transcript.
- Return type:
IntronTable
- class IntronTable[source]
- find_5p_aligned(interval)[source]
Returns all introns that have 5’ end aligned with 5’ end of interval.
This function is a faster version of the brute-force filter:
[intr for intr in introns if intr.end5 == interval.end5]
The results will also be sorted by 5’.
- find_3p_aligned(interval)[source]
Returns all introns that have 3’ end aligned with 3’ end of interval.
This function is a faster version of the brute-force filter:
[intr for intr in introns if intr.end3 == interval.end3]
The results will also be sorted by 3’.
- find_5p_within(interval)[source]
Returns all introns that have 5’-most position within interval.
This function is a faster version of the brute-force filter:
[intr for intr in introns if intr.end5.expand(0, 1) in interval]
The results will also be sorted by 5’.
- find_3p_within(interval)[source]
Returns all introns that have 3’-most position within interval.
This function is a faster version of the brute-force filter:
[intr for intr in introns if intr.end3.expand(1, 0) in interval]
The results will also be sorted by 3’.
- find_within(interval)[source]
Returns all introns that fall entirely within interval.
This function is a faster version of the brute-force filter:
[intr for intr in introns if intr in interval]
The results will also be sorted by 3’.
- find_overlapping(interval)[source]
Returns all introns that overlap interval.
This function is a faster version of the brute-force filter:
[intr for intr in introns if intr.overlaps(interval)]
The results will also be sorted by 5’.
- find_exact(interval)[source]
Returns all introns that span exactly interval.
This function is a faster version of the brute-force filter:
[intr for intr in introns if intr.interval == interval]
The results will also be sorted by 5’ and then 3’.
- property stranded
If True then the strand is significant when calling the find_x methods.
- Returns:
Whether this table can contain negative stranded intervals.
- Return type:
Cds
- class Cds[source]
An annotated coding sequence (CDS) element within an exon.
Bases:
Interval
- property interval
The interval spanned by this CDS.
Note that
Cds
also inherits fromInterval
.- Returns:
The interval spanned by this CDS.
- Return type:
- property phase
The phase of this CDS’s start position.
The phase is the number of bases that should be trimmed from the 5’ end of the CDS element to reach the start of next codon:
phase 0 means the 5p end is in-frame
phase 1 means the 5p end is 1 bp away from being in-frame.
phase 2 means the 5p end is 2 bp away from being in-frame.
See the GFF3 documentation <http://gmod.org/wiki/GFF3> for a precise explanation.
The frame attribute in UCSC RefSeq is not available.
- Returns:
The phase this CDS, either 0, 1, or 2.
- Return type:
- property transcript
The parent transcript of this CDS.
A shorthand property tran is also available.
- Returns:
The parent transcript of this CDS.
- Return type:
- property next_cds
The next CDS in the transcript, if any.
- property prev_cds
The previous CDS in the transcript, if any.
CdsTable
- class CdsTable[source]
- find_5p_aligned(interval)[source]
Returns all CDSs that have 5’ end aligned with 5’ end of interval.
This function is a faster version of the brute-force filter:
[cds for cds in cdss if cds.end5 == interval.end5]
The results will also be sorted by 5’.
- find_3p_aligned(interval)[source]
Returns all CDSs that have 3’ end aligned with 3’ end of interval.
This function is a faster version of the brute-force filter:
[cds for cds in cdss if cds.end3 == interval.end3]
The results will also be sorted by 3’.
- find_5p_within(interval)[source]
Returns all CDSs that have 5’-most position within interval.
This function is a faster version of the brute-force filter:
[cds for cds in cdss if cds.end5.expand(0, 1) in interval]
The results will also be sorted by 5’.
- find_3p_within(interval)[source]
Returns all CDSs that have 3’-most position within interval.
This function is a faster version of the brute-force filter:
[cds for cds in cdss if cds.end3.expand(1, 0) in interval]
The results will also be sorted by 3’.
- find_within(interval)[source]
Returns all CDSs that fall entirely within interval.
This function is a faster version of the brute-force filter:
[cds for cds in cdss if cds in interval]
The results will also be sorted by 3’.
- find_overlapping(interval)[source]
Returns all CDSs that overlap interval.
This function is a faster version of the brute-force filter:
[cds for cds in cdss if cds.overlaps(interval)]
The results will also be sorted by 5’.
- find_exact(interval)[source]
Returns all CDSs that span exactly interval.
This function is a faster version of the brute-force filter:
[cds for cds in cdss if cds.interval == interval]
The results will also be sorted by 5’ and then 3’.
- property stranded
If True then the strand is significant when calling the find_x methods.
- Returns:
Whether this table can contain negative stranded intervals.
- Return type:
Read Alignments
ReadAlignments
- class ReadAlignments(infile)[source]
Access to read alignments.
- property junctions
Access to junctions inferred from read alignments
- Returns:
A table with optimized methods to query over all junctions by position.
- Return type:
- property alignments
Access to read alignments, i.e. a row in BAM file
- Returns:
A table with optimized methods to query over all read alignments by position.
- Return type:
- property matches
Access to aligned regions of read alignments
- Returns:
A table with optimized methods to query over all aligned regions by position.
- Return type:
- property variants
Access to variants observed by read alignments
- Returns:
A table where each element (each row) is a variant that was observed in at least one read alignment.
- Return type:
- property filename
The path to the file from which read alignments are retrieved.
- Returns:
The path to the file, e.g. “/mnt/data/sample000132.ralign”
- Return type:
- close()[source]
Close the file handle.
Note that if you use a with statement the file handle is automatically closed:
# Open the file with ReadAlignments('my_file.ralign') as raligns: ... # <-- File is now closed
- static ralign_version()[source]
The ralign file format version that this build supports.
Attempting to open an ralign file of a mismatched version will raise an
IOError
.- Returns:
The version number.
- Return type:
- static build_ralign(outfile, infiles, reference_genome, exclude=None, allow=None, include_duplicates=False, library_format='U')[source]
Build an ralign file from one or more SAM files.
If multiple input files are specified, their junctions and alignments will be pooled.
Once an ralign file is created, it can be opened by creating a
ReadAlignments
object.Alignments on the reference genome. All junction, alignment, match, and variant intervals in the resulting file are on the reference genome.
For example, an alignment starting at reference genome position 2 and with CIGAR string
1M1D1M2N1M
is matched to the reference genome as:0 1 2 3 4 5 6 7 8 9 <-- reference genome coordinates M D M - - M M <-- M = match, D = deletion
This results in one alignment (2:9), two matches (2:5, 7:9), one junction (5:7), and one variant (3:4).
Similarly for insertions, with CIGAR string
1M1I1M2N1M
aligns as:0 1 2 3 4 5 6 7 8 9 M M - - M M ^ I <-- I = insertion between
This results in one alignment (2:8), two matches (2:4, 6:8), one junction (4:6), and one variant (3:3).
- Parameters:
outfile (
str
) – The path to the destination rdist file.infiles (
list
ofstr
|file
) – The paths to the source SAM files, or a reference tosys.stdin
. Streaming lines from stdin is useful for reading directly from BAM viasamtools view -h
. Be sure to use-h
so that the reference genome can be inferred from the header lines.reference_genome (
Genome
) – The reference genome of the data ininfiles
. If the genome contains annotations, junction strands will be inferred and validated against the introns annotations.exclude (
list
ofInterval
, optional) – Junctions within these intervals will be excluded.allow (
list
ofInterval
, optional) – Only junctions within these intervals will be included, so long as they are not excluded.include_duplicates (
bool
, optional) – Includes duplicates when True. Defaults to False. See https://github.com/samtools/hts-specs/blob/master/SAMv1.pdf subsection 1.4.2library_format (
str
, optional) – Specifies to infer junction strands via the library format string. “SF” implies the first read is the sense strand, whereas “SR” implies the second read is the sense strand. (the default is ‘U’, which represents unstranded [no strand inference]). See https://salmon.readthedocs.io/en/stable/library_type.html.
- pileup(interval, dtype=<class 'numpy.int32'>)[source]
Returns the DNA sequence pileup for body reads in this ReadAlignments as a 5-track. The pileup is variant-aware, so the track counts the alternate sequences, instead of the reference, on any alignment matches with variants.
Currently, only acgtACGT substitutions and deletions variants are supported.
- Parameters:
- Returns:
An array of
shape
(len(interval), 5) where the second dimension is indexed viaPILEUP_A
,PILEUP_C
,PILEUP_G
,PILEUP_T
, andPILEUP_DEL
.- Return type:
Junction
- class Junction[source]
A Junction represents a gap in a read alignment when compared to the reference genome.
Bases:
Interval
- property interval
The interval spanned by this junction.
Note that
Junction
also inherits fromInterval
.- Returns:
The interval spanned by this junction.
- Return type:
- property alignments
The read alignments across this junction.
JunctionTable
- class JunctionTable[source]
- find_5p_aligned(interval)[source]
Returns all junctions that have 5’ end aligned with 5’ end of interval.
This function is a faster version of the brute-force filter:
[junc for junc in junctions if junc.end5 == interval.end5]
- find_3p_aligned(interval)[source]
Returns all junctions that have 3’ end aligned with 3’ end of interval.
This function is a faster version of the brute-force filter:
[junc for junc in junctions if junc.end3 == interval.end3]
- find_5p_within(interval)[source]
Returns all junctions that have 5’-most position within interval.
This function is a faster version of the brute-force filter:
[junc for junc in junctions if junc.end5.expand(0, 1) in interval]
- find_3p_within(interval)[source]
Returns all junctions that have 3’-most position within interval.
This function is a faster version of the brute-force filter:
[junc for junc in junctions if junc.end3.expand(1, 0) in interval]
- find_within(interval)[source]
Returns all junctions that fall entirely within interval.
This function is a faster version of the brute-force filter:
[junc for junc in junctions if junc in interval]
- find_overlapping(interval)[source]
Returns all junctions that overlap interval.
This function is a faster version of the brute-force filter:
[junc for junc in junctions if junc.overlaps(interval)]
- find_exact(interval)[source]
Returns all junctions that span exactly interval.
This function is a faster version of the brute-force filter:
[junc for junc in junctions if junc.interval == interval]
- property stranded
If True then the strand is significant when calling the find_x methods.
- Returns:
Whether this table can contain negative stranded intervals.
- Return type:
Alignment
- class Alignment[source]
A read alignment
Bases:
Interval
- property interval
The interval spanned by this read alignment.
Note that
Alignment
also inherits fromInterval
.- Returns:
The interval spanned by this read alignment.
- Return type:
- property matches
The matching regions of this read alignment
- Returns:
A list of matching regions
- Return type:
AlignmentTable
- class AlignmentTable[source]
- find_5p_aligned(interval)[source]
Returns all read alignments that have 5’ end aligned with 5’ end of interval.
This function is a faster version of the brute-force filter:
[read for read in reads if read.end5 == interval.end5]
- find_3p_aligned(interval)[source]
Returns all read alignments that have 3’ end aligned with 3’ end of interval.
This function is a faster version of the brute-force filter:
[read for read in reads if read.end3 == interval.end3]
- find_5p_within(interval)[source]
Returns all read alignments that have 5’-most position within interval.
This function is a faster version of the brute-force filter:
[read for read in reads if read.end5.expand(0, 1) in interval]
- find_3p_within(interval)[source]
Returns all read alignments that have 3’-most position within interval.
This function is a faster version of the brute-force filter:
[read for read in reads if read.end3.expand(1, 0) in interval]
- find_within(interval)[source]
Returns all read alignments that fall entirely within interval.
This function is a faster version of the brute-force filter:
[read for read in reads if read in interval]
- find_overlapping(interval)[source]
Returns all read alignments that overlap interval.
This function is a faster version of the brute-force filter:
[read for read in reads if read.overlaps(interval)]
- find_exact(interval)[source]
Returns all read alignments that span exactly interval.
This function is a faster version of the brute-force filter:
[read for read in reads if read.interval == interval]
- property stranded
If True then the strand is significant when calling the find_x methods.
- Returns:
Whether this table can contain negative stranded intervals.
- Return type:
AlignmentMatch
- class AlignmentMatch[source]
A matching region a read alignment
Bases:
Interval
- property interval
The interval spanned by the matching region of a read alignment.
Note that
AlignmentMatch
also inherits fromInterval
.- Returns:
The interval spanned by this alignment match.
- Return type:
- property variants
A list of variants observed in this matching region of an alignment.
AlignmentMatchTable
- class AlignmentMatchTable[source]
- find_5p_aligned(interval)[source]
Returns all alignment matches that have 5’ end aligned with 5’ end of interval.
This function is a faster version of the brute-force filter:
[match for match in matches if match.end5 == interval.end5]
- Parameters:
interval (
Interval
) – The query interval.- Returns:
All alignment matches that have 5’ end aligned to the 5’ end of interval.
- Return type:
- find_3p_aligned(interval)[source]
Returns all alignment matches that have 3’ end aligned with 3’ end of interval.
This function is a faster version of the brute-force filter:
[match for match in matches if match.end3 == interval.end3]
- Parameters:
interval (
Interval
) – The query interval.- Returns:
All alignment matches that have 3’ end aligned to the 3’ end of interval.
- Return type:
- find_5p_within(interval)[source]
Returns all alignment matches that have 5’-most position within interval.
This function is a faster version of the brute-force filter:
[match for match in matches if match.end5.expand(0, 1) in interval]
- Parameters:
interval (
Interval
) – The query interval.- Returns:
All alignment matches that have 5’-most position falling entirely within interval.
- Return type:
- find_3p_within(interval)[source]
Returns all alignment matches that have 3’-most position within interval.
This function is a faster version of the brute-force filter:
[match for match in matches if match.end3.expand(1, 0) in interval]
- Parameters:
interval (
Interval
) – The query interval.- Returns:
All alignments matches that have 3’-most position falling entirely within interval.
- Return type:
- find_within(interval)[source]
Returns all alignment matches that fall entirely within interval.
This function is a faster version of the brute-force filter:
[match for match in matches if match in interval]
- Parameters:
interval (
Interval
) – The query interval- Returns:
All alignment matches that fall entirely within interval.
- Return type:
- find_overlapping(interval)[source]
Returns all alignment matches that overlap interval.
This function is a faster version of the brute-force filter:
[match for match in matches if match.overlaps(interval)]
- Parameters:
interval (
Interval
) – The query interval.- Returns:
All alignment matches that overlap interval.
- Return type:
- find_exact(interval)[source]
Returns all alignment matches that span exactly interval.
This function is a faster version of the brute-force filter:
[match for match in matches if match.interval == interval]
- Parameters:
interval (
Interval
) – The query interval.- Returns:
All alignment matches that span exactly interval.
- Return type:
- property stranded
If True then the strand is significant when calling the find_x methods.
- Returns:
Whether this table can contain negative stranded intervals.
- Return type:
Junction Read Alignments
JReadAlignments
- class JReadAlignments(infile)[source]
Access to read alignments.
- __init__(infile)[source]
Open a .jralign file.
- Parameters:
infile (
str
) – The .jralign file to open.
- property junctions
Access to read alignments that are split across junctions.
- Returns:
An object that supports looping or queries over junctions, where each junction provides a read distribution.
- Return type:
Notes
The returned table. junctions, and child objects should not be accessed after this JReadAlignments has been closed via
close()
.
- property variants
Access to variants observed over all read alignments
- Returns:
A table where each element (each row) is a variant that was observed in at least one read alignment.
- Return type:
Notes
The returned table and variants should not be accessed after this JReadAlignments has been closed via
close()
.
- property filename
The path to the file from which read alignments are retrieved.
- Returns:
The path to the file, e.g. “/mnt/data/sample000132.jralign”
- Return type:
- close()[source]
Close the file handle.
Note that if you use a with statement the file handle is automatically closed:
# Open the file with JReadAlignments('my_file.jralign') as jraligns: ... # <-- File is now closed
Notes
Closing this JReadAlignments will invalidate all tables and objects previously accessed via
junctions()
andvariants()
.
- static jralign_version()[source]
The jralign file format version that this build supports.
Attempting to open an jralign file of a mismatched version will raise an
IOError
.- Returns:
The version number.
- Return type:
- static build_jralign(outfile, infiles, reference_genome, min_reads=0, min_overhang=0, exclude=None, allow=None, include_variants=False, include_duplicates=False, library_format='U', overhang_error='error')[source]
Build an jralign file from one or more SAM files.
If multiple input files are specified, their junctions and read alignments will be pooled.
Once an jralign file is created, it can be opened by creating a
JReadAlignments
object.Below are some details on how SAM alignments are processed, which are important for understanding how a CIGAR strings containing insertions/deletions (5M8D5M) or multiple splits (5M100N10M50N5M) are processed:
Multi-split alignments. If a read maps across multiple junctions, its CIGAR string will contain multiple
N
blocks. The read will then contribute a separate (left, right) alignment to each junction that it spans.For example, consider
1S2M2N1M1N1M
where the first match is aligned to start position 2:0 1 2 3 4 5 6 7 8 9 S M M - - M - M
This read will contribute a separate alignment to each junction. The left and right overhang is the sum of all alignment matches before/after that particular junction. In the above example, the resulting alignments would be be as if two separate reads were aligned as follows:
0 1 2 3 4 5 6 7 8 9 S M M - - M M <-- alignment for 1st junction S M M M - M <-- alignment for 2nd junction
In this scenario, the read would not actually match the reference genome in the above alignments, but this scheme facilitates counting of relative junction usage (assuming usage of the two junctions is independent) and also is allows positional bootstrap to be applied correctly without modification.
Overhang filter. The
min_overhang
filter is applied on a per-junction bases. For example, ifmin_overhang=5
then a read mapped across two junctions20M100N10M100N3M
will contribute an alignment only to the first, because the second has a right overhang of 3.Counting insertions/deletions. Only alignment matches are counted. Other CIGAR types (e.g., insertions, deletions, clippings) are not relevant when filtering for alignment quality.
For example, a read starting at reference genome position 2 and with CIGAR string
1M1D1M2N1M
is matched to the reference genome as:0 1 2 3 4 5 6 7 8 9 <-- reference genome coordinates M D M - - M <-- M = match, D = deletion
so the start of the junction is 2+1+1+1 = 5, but the the left overhang is only 2, because the deleted base doesn’t exist in the private genome.
Similarly for insertions, with CIGAR string
1M1I1M2N1M
:0 1 2 3 4 5 6 7 8 9 M M - - M ^ I <-- I = insertion between the start of the junction is 2+1+1 = 4, but the left overhang is actually 2, because the inserted base exists only in the private genome.
- Parameters:
outfile (
str
) – The path to the destination rdist file.infiles (
list
ofstr
|file
) – The paths to the source SAM files, or a reference tosys.stdin
. Streaming lines from stdin is useful for reading directly from BAM viasamtools view -h
. Be sure to use-h
so that the reference genome can be inferred from the header lines.reference_genome (
Genome
) – The reference genome of the data ininfiles
. If the genome contains annotations, junction strands will be inferred and validated against the introns annotations.min_reads (
int
) – Optional. The minimum number of reads across a junction to be considered for inclusion in the output file.min_overhang (
int
) – Optional. The minimum overhang for a read to be included.exclude (
list
ofInterval
) – Optional. Junctions within these intervals will be excluded.allow (
list
ofInterval
) – Optional. Only junctions within these intervals will be included, so long as they are not excluded.include_variants (:py:class`bool`) – Optional. True if variants are to be included in the output file. Default will exclude all variants.
include_duplicates (
bool
, optional) – Includes duplicates when True. Defaults to False. See https://github.com/samtools/hts-specs/blob/master/SAMv1.pdf subsection 1.4.2library_format (
str
, optional) – Specifies to infer junction strands via the library format string. “SF” implies the first read is the sense strand, whereas “SR” implies the second read is the sense strand. (the default is ‘U’, which represents unstranded [no strand inference]). See https://salmon.readthedocs.io/en/stable/library_type.html.overhang_error (:
str
) – Determines how build_jralign handles reads with >255 length overhangs. “error” stops and reports the error. “clamp” limits the read to the maximum 255 positions on each side. This only affects overhangs and will not clip or filter variants.
JunctionReadAlignmentsTable
- class JunctionReadAlignmentsTable[source]
- find_5p_aligned(interval)[source]
Returns all junctions that have 5’ end aligned with 5’ end of interval.
This function is a faster version of the brute-force filter:
[junc for junc in junctions if junc.end5 == interval.end5]
The results will also be sorted by 5’.
- Parameters:
interval (
Interval
) – The query interval.- Returns:
All junctions that have 5’ end aligned to the 5’ end of interval.
- Return type:
- find_3p_aligned(interval)[source]
Returns all junctions that have 3’ end aligned with 3’ end of interval.
This function is a faster version of the brute-force filter:
[junc for junc in junctions if junc.end3 == interval.end3]
The results will also be sorted by 3’.
- Parameters:
interval (
Interval
) – The query interval.- Returns:
All junctions that have 3’ end aligned to the 3’ end of interval.
- Return type:
- find_5p_within(interval)[source]
Returns all junctions that have 5’-most position within interval.
This function is a faster version of the brute-force filter:
[junc for junc in junctions if junc.end5.expand(0, 1) in interval]
The results will also be sorted by 5’.
- Parameters:
interval (
Interval
) – The query interval.- Returns:
All junctions that have 5’-most position falling entirely within interval.
- Return type:
- find_3p_within(interval)[source]
Returns all junctions that have 3’-most position within interval.
This function is a faster version of the brute-force filter:
[junc for junc in junctions if junc.end3.expand(1, 0) in interval]
The results will also be sorted by 3’.
- Parameters:
interval (
Interval
) – The query interval.- Returns:
All junctions that have 3’-most position falling entirely within interval.
- Return type:
- find_within(interval)[source]
Returns all junctions that fall entirely within interval.
This function is a faster version of the brute-force filter:
[junc for junc in junctions if junc in interval]
The results will also be sorted by 3’.
- Parameters:
interval (
Interval
) – The query interval.- Returns:
All junctions that fall entirely within interval.
- Return type:
- find_overlapping(interval)[source]
Returns all junctions that overlap interval.
This function is a faster version of the brute-force filter:
[junc for junc in junctions if junc.overlaps(interval)]
The results will also be sorted by 5’.
- Parameters:
interval (
Interval
) – The query interval.- Returns:
All junctions that overlap interval.
- Return type:
- find_exact(interval)[source]
Returns all junctions that span exactly interval.
This function is a faster version of the brute-force filter:
[junc for junc in junctions if junc.interval == interval]
The results will also be sorted by 5’ and then 3’.
- Parameters:
interval (
Interval
) – The query interval.- Returns:
All junctions that span exactly interval.
- Return type:
- property stranded
If True then the strand is significant when calling the find_x methods.
- Returns:
Whether this table can contain negative stranded intervals.
- Return type:
- __getitem__(index)[source]
Access to all junctions.
Allows iteration over all junctions in the read distribution table:
for junc in junctions: ...
The results will be ordered as in the source file.
- Parameters:
index (
int
) – The integer index of the requested junction.- Returns:
The junction read alignment object identified by the given index.
- Return type:
JunctionReadAlignments
- class JunctionReadAlignments[source]
A set of read alignments across a junction.
Bases:
Interval
- property interval
The interval spanned by this junction.
Note: All junction intervals are on the reference strand, even if they comprise reads that were mapped to the reverse strand. When querying junctions, your query intervals must therefore (currently) be on the positive strand.
Note that
JunctionReadAlignments
also inherits fromInterval
.- Returns:
The interval spanned by this junction.
- Return type:
- property num_reads
The number of reads across this junction.
- Returns:
The number of reads across this junction, equivalent to the len of this object.
- Return type:
- __getitem__(index)[source]
Access to a junction read alignment.
Allows iteration over all read alignments across this junction:
for align in junction: ...
- Parameters:
index (
int
) – The integer index of the requested alignment entry.- Returns:
The junction read alignment identified by the given index.
- Return type:
JunctionReadAlignment
- class JunctionReadAlignment[source]
A junction read alignment.
The left and right overhangs are positive integers representing the number of matched positions on either side of the junction.
Overhangs are counted on the private genome from which the data was sequenced, and may not map to reference genome coordinates. As such, one should not expect meaningful reference genome coordinates by adding overhangs to the junction start/end.
If strand is positive, then the read started left positions before the first intronic position on the private genome. If strand is negative, then the read started right positions after the last intronic position on the private genome.
- property left
The left overhang relative to the junction.
- Returns:
The number of matched positions left of the junction.
- Return type:
- property right
The right overhang relative to the junction.
- Returns:
The number of matched positions right of the junction.
- Return type:
- property strand
The strand (“+” or “-”) that this read was aligned to.
Not to be confused with the strand of the junction itself, which may or may not have been inferred.
- Returns:
The strand the read was aligned to by the original aligner.
- Return type:
chr
- property num_variants
The number of variants observed in this read alignment.
- Returns:
The number of variants in the alignment.
- Return type:
ReadDistributions
- class ReadDistributions(infile)[source]
Access to read distributions.
- property junctions
Access to read distributions that are split across junctions.
- Returns:
An object that supports looping or queries over junctions, where each junction provides a read distribution.
- Return type:
- property filename
The path to the file from which read distributions are retrieved.
- Returns:
The path to the file, e.g. “/mnt/data/sample000132.rdist”
- Return type:
- close()[source]
Close the file handle.
Note that if you use a with statement the file handle is automatically closed:
# Open the file with ReadDistributions('my_file.ralign') as raligns: ... # <-- File is now closed
- static rdist_version()[source]
The rdist file format version that this build supports.
Attempting to open an rdist file of a mismatched version will raise an
IOError
.- Returns:
The version number.
- Return type:
- static build_rdist(outfile, infiles, min_reads=0, min_overhang=0, exclude=None, allow=None, pass_filter=None)[source]
Build an rdist file from one or more jralign files.
If multiple input files are specified, their read counts will be pooled.
Once an rdist file is created, it can be opened by creating a
ReadDistributions
object .- Parameters:
outfile (
str
) – The path to the destination rdist file.infiles (
list
ofstr
) – The paths to the source jralign files.min_reads (
int
) – Optional. The minimum number of reads across a junction to be considered for inclusion in the output file.min_overhang (
int
) – Optional. The minimum overhang for a read to be included in a count.exclude (
list
ofInterval
) – Optional. Junctions within these intervals will be excluded.allow (
list
ofInterval
) – Optional. Only junctions within these intervals will be included, so long as they are not excluded.pass_filter (Callable[[str, int, int], bool], optional) –
A pass filter to be applied while iterating over the (infile, junction_index, read_index). For example:
# downsample to 10% pass_filter=lambda _, _, _: np.random.binomial(1, 0.1) != 0 # only take reads that have variants jraligns = {x: JReadAlignments(x) for x in infiles} pass_filter=lambda infile, junction, read: jraligns[infile].junctions[junction][read].num_variants != 0 ... for x in jraligns.values(): x.close()
- static from_jraligns(infiles, min_reads=0, min_overhang=0, exclude=[], allow=[], pass_filter=None, cache=True, outfile=None)[source]
Returns a
ReadDistributions
object opened from a set of jralign files.This convenience function builds and then returns ReadDistributions in one step, by default the binary file (.rdist) is adjacent to the first input file.
It is intended to be used as follows:
rdist = ReadDistributions.from_jraligns(["input.jralign"]) junctions = rdist.find_within(brca1_interval) ...
If any of the filtering parameters or infiles contents change, cache should be set to False to regenerate the ReadDistribution.
See also
- Parameters:
infiles (
list
ofstr
) – The paths to the source jralign files.min_reads (
int
) – Optional. The minimum number of reads across a junction to be considered for inclusion in the output file.min_overhang (
int
) – Optional. The minimum overhang for a read to be included in a count.exclude (
list
ofInterval
) – Optional. Junctions within these intervals will be excluded.allow (
list
ofInterval
) – Optional. Only junctions within these intervals will be included, so long as they are not excluded.pass_filter (Callable[[str, int, int], bool], optional) –
A pass filter to be applied while iterating over the (infile, junction_index, read_index). For example:
# downsample to 10% pass_filter=lambda _, _, _: np.random.binomial(1, 0.1) != 0 # only take reads that have variants jraligns = {x: JReadAlignments(x) for x in infiles} pass_filter=lambda infile, junction, read: jraligns[infile].junctions[junction][read].num_variants != 0 ... for x in jraligns.values(): x.close()
cache (
bool
, optional) – Whether to re-use the binary or to re-generate it from infiles.outfile (
str
, optional) – The path to the destination rdist file.
- Returns:
The opened rdist file.
- Return type:
JunctionReadDistributionTable
- class JunctionReadDistributionTable[source]
- find_5p_aligned(interval)[source]
Returns all junctions that have 5’ end aligned with 5’ end of interval.
This function is a faster version of the brute-force filter:
[junc for junc in junctions if junc.end5 == interval.end5]
- Parameters:
interval (
Interval
) – The query interval.5'. (The results will also be sorted by)
- Returns:
All junctions that have 5’ end aligned to the 5’ end of interval.
- Return type:
- find_3p_aligned(interval)[source]
Returns all junctions that have 3’ end aligned with 3’ end of interval.
This function is a faster version of the brute-force filter:
[junc for junc in junctions if junc.end3 == interval.end3]
The results will also be sorted by 3’.
- Parameters:
interval (
Interval
) – The query interval.- Returns:
All junctions that have 3’ end aligned to the 3’ end of interval.
- Return type:
- find_5p_within(interval)[source]
Returns all junctions that have 5’-most position within interval.
This function is a faster version of the brute-force filter:
[junc for junc in junctions if junc.end5.expand(0, 1) in interval]
The results will also be sorted by 5’.
- Parameters:
interval (
Interval
) – The query interval.- Returns:
All junctions that have 5’-most position falling entirely within interval.
- Return type:
- find_3p_within(interval)[source]
Returns all junctions that have 3’-most position within interval.
This function is a faster version of the brute-force filter:
[junc for junc in junctions if junc.end3.expand(1, 0) in interval]
The results will also be sorted by 3’.
- Parameters:
interval (
Interval
) – The query interval.- Returns:
All junctions that have 3’-most position falling entirely within interval.
- Return type:
- find_within(interval)[source]
Returns all junctions that fall entirely within interval.
This function is a faster version of the brute-force filter:
[junc for junc in junctions if junc in interval]
The results will also be sorted by 3’.
- Parameters:
interval (
Interval
) – The query interval.- Returns:
All junctions that fall entirely within interval.
- Return type:
- find_overlapping(interval)[source]
Returns all junctions that overlap interval.
This function is a faster version of the brute-force filter:
[junc for junc in junctions if junc.overlaps(interval)]
The results will also be sorted by 5’.
- Parameters:
interval (
Interval
) – The query interval.- Returns:
All junctions that overlap interval.
- Return type:
- find_exact(interval)[source]
Returns all junctions that span exactly interval.
This function is a faster version of the brute-force filter:
[junc for junc in junctions if junc.interval == interval]
The results will also be sorted by 5’ and then 3’.
- Parameters:
interval (
Interval
) – The query interval.- Returns:
All junctions that span exactly interval.
- Return type:
- property stranded
If True then the strand is significant when calling the find_x methods.
- Returns:
Whether this table can contain negative stranded intervals.
- Return type:
- __getitem__(index)[source]
Access to all junctions.
Allows iteration over all junctions in the read distribution table:
for junc in junctions: ...
The results will be ordered as in the source file.
- Parameters:
index (
int
) – The integer index of the requested junction.- Returns:
The junction read distribution object identified by the given index.
- Return type:
JunctionReadDistribution
- class JunctionReadDistribution[source]
A read distribution across a junction. A junction read distribution comprises a list of non-zero counts across the junction, where the shift and strand of each count is preserved.
Each read mapping across the junction corresponds to two counts: one for the part mapping upstream of the junction and another for the part mapping downstream of it.
Bases:
Interval
- property interval
The interval spanned by this junction.
Note: All junction intervals are on the reference strand, even if they comprise reads that were mapped to the reverse strand. When querying junctions, your query intervals must therefore (currently) be on the positive strand.
Note that
JunctionReadDistribution
also inherits fromInterval
.- Returns:
The interval spanned by this junction.
- Return type:
- property num_counts
The number of counts on this junction.
- Returns:
The number of unique non-zero counts stored on this junction, equivalent to the len of this object.
- Return type:
- property num_reads
The number of reads across this junction.
- Returns:
The number of reads across this junction, equal to the summing all counts and dividing by two.
- Return type:
JunctionReadCount
- class JunctionReadCount[source]
A junction read count at a specific shift from the junction.
- property strand
The strand that this count refers to. :returns: The strand that this count refers to. :rtype:
str
- property shift
The number of aligned positions between the far end (start/ end) of the alignment and the junction.
If negative, the count refers to an alignment upstream of the junction interval (always defined on the positive strand). Does not count insertions, deletions, clippings, etc.
- Returns:
The shift relative to the junction.
- Return type:
VCF Tables
VCFTable
- class VCFTable(infile)[source]
Open a
.vcfbin
file for querying.See the VCF section of Quickstart for examples of how to use this class, or see
demos/query_vcf.py
.To convert from
.vcf[.gz]
seebuild_vcfbin()
.- Parameters:
infile (
str
) – The path to the.vcfbin
file.
- static from_vcf(vcfpath: str, reference_genome: Genome, *args, normalize: bool | Literal['EBI', 'UCSC'] = False, cache: bool = True, vcfbinpath: str | None = None, **kwargs) VCFTable [source]
Returns a
VCFTable
object opened to a VCF file.The function internally converts the VCF to a binary format and then returns a VCFTable object for fast access to the variants and fields within. By default the binary file (.vcf.bin) will be written adjacent to the input file.
It is intended to be used as follows:
def open_clinvar(vcfpath): return VCFTable.from_vcf(vcfpath, "hg19", info_ids=["AF_EXAC", "CLNSIG"]) vcf = open_clinvar("clinvar_20180128.vcf.gz") variants = vcf.find_within(brca1_interval) ...
If the input VCF contains multi-allelic sites, then use normalize to automatically split them on conversion. This step requires samtools and bcftools to be installed.
See also
- Parameters:
vcfpath – Path to input .vcf or `.vcf.gz file.
reference_genome – The reference genome (e.g. “hg19”). Forwarded to
build_vcfbin()
.args – Parameters forwarded to
build_vcfbin()
.normalize – Whether to normalize the VCF during conversion (if applicable). If True, bcftools is used for normalization of all the EBI (Ensembl) chromosomes; alternatively, EBI or UCSC can be used to specify the naming scheme used by the VCF (eg. GTEx uses the UCSC format). Unsupported on Windows due to bcftools and samtools lack of support.
cache – Whether to re-use the binary or to re-generate it from vcfpath.
vcfbinpath – Path to the intermediate binary. Default is adjacent to vcfpath.
kwargs – Parameters forwarded to
build_vcfbin()
.
- Returns:
The opened VCF file.
- Return type:
- find_5p_aligned(interval)[source]
Returns all variants that have 5’ end aligned with 5’ end of interval.
This function is a faster version of the brute-force filter:
[variant for variant in vcf if variant.end5 == interval.end5]
The results will also be sorted by 5’.
- Parameters:
interval (
Interval
) – The query interval.- Returns:
All variants that have 5’ end aligned to the 5’ end of interval.
- Return type:
list
ofVCFVariant
- find_3p_aligned(interval)[source]
Returns all variants that have 3’ end aligned with 3’ end of interval.
This function is a faster version of the brute-force filter:
[variant for variant in vcf if variant.end3 == interval.end3]
The results will also be sorted by 3’.
- Parameters:
interval (
Interval
) – The query interval.- Returns:
All variants that have 3’ end aligned to the 3’ end of interval.
- Return type:
list
ofVCFVariant
- find_5p_within(interval)[source]
Returns all variants that have 5’ end within interval.
This function is a faster version of the brute-force filter:
[variant for variant in vcf if variant.end5 in interval]
The results will also be sorted by 5’.
- Parameters:
interval (
Interval
) – The query interval.- Returns:
All variants that have 5’ end falling entirely within interval.
- Return type:
list
ofVCFVariant
- find_3p_within(interval)[source]
Returns all variants that have 3’ end within interval.
This function is a faster version of the brute-force filter:
[variant for variant in vcf if variant.end3 in interval]
The results will also be sorted by 3’.
- Parameters:
interval (
Interval
) – The query interval.- Returns:
All variants that have 3’ end falling entirely within interval.
- Return type:
list
ofVCFVariant
- find_within(interval)[source]
Returns all variants that fall entirely within interval.
This function is a faster version of the brute-force filter:
[variant for variant in vcf if variant in interval]
The results will also be sorted by 3’.
- Parameters:
interval (
Interval
) – The query interval.- Returns:
All variants that fall entirely within interval.
- Return type:
list
ofVCFVariant
- find_overlapping(interval)[source]
Returns all variants that overlap interval.
This function is a faster version of the brute-force filter:
[variant for variant in vcf if variant.overlaps(interval)]
The results will also be sorted by 5’.
- Parameters:
interval (
Interval
) – The query interval.- Returns:
All variants that overlap interval.
- Return type:
list
ofVCFVariant
- find_exact(interval)[source]
Returns all variants that span exactly interval.
This function is a faster version of the brute-force filter:
[variant for variant in vcf if variant.interval == interval]
The results will also be sorted by 5’ and then 3’.
- Parameters:
interval (
Interval
) – The query interval.- Returns:
All variants that span exactly interval.
- Return type:
list
ofVCFVariant
- property stranded
If True then the strand is significant when calling the find_x methods.
- Returns:
Whether this table can contain negative stranded intervals.
- Return type:
- __getitem__(index)[source]
Access to all variants.
Allows iteration over all variants:
for variant in variants: ...
The results will be ordered as in the source file.
- Parameters:
index (
int
) – The integer index of the requested variant.- Returns:
The variant object identified by the given index.
- Return type:
Notes
The VCFVariant will be invalidated after this table has been closed via
close()
or awith
statement.
- where(mask)[source]
Filter variants by numpy mask.
Fast extraction of table elements based on a mask:
# Intended to be used like this. variants = vcf.where(mask) # It is faster but equivalent to this. variants = [vcf[i] for i in np.where(mask)[0]]
- Parameters:
mask (
ndarray
) – A boolean mask the same size as the table.- Returns:
A list of variants selected by the mask.
- Return type:
list
ofVCFVariant
- masked(mask)[source]
Return a new VCFTable which is filtered by a numpy mask
For example, to do a random downsample:
downsampled = vcf.masked(np.random.binomial(1, 0.1, len(vcf)).astype(bool))
Or to subset by only ABCA4 genes:
abca4_vcf = vcf.masked(vcf.index_of(v) for v in vcf.find_overlapping(gene))
Notes
The masked VCFTable does not share resources with the original, so
close()
should be called on both instances when required.- Parameters:
mask (
ndarray
ofbool
orndarray
ofint
) – A boolean mask the same size as the table or a list of indices.- Returns:
A list of variants selected by the mask.
- Return type:
list
ofVCFVariant
- GT_HOMOZYGOUS_REF = 0
Constant indicating a
0/0
or0|0
in GT format value.
- GT_HETEROZYGOUS_UNPHASED = 1
Constant indicating a
0/1
in GT format value.
- GT_HOMOZYGOUS_ALT = 2
Constant indicating a
1/1
or1|1
in GT format value.
- GT_UNKNOWN = 3
Constant indicating a
?/?
or?|?
in GT format value.
- GT_HETEROZYGOUS_PHASED_0_1 = 4
Constant indicating a
0|1
in GT format value.
- GT_HETEROZYGOUS_PHASED_1_0 = 5
Constant indicating a
1|0
in GT format value.
- SVTYPE_NA = 0
Constant indicating the SVTYPE was missing.
- property num_samples
The number of samples represented in the VCF.
- Returns:
Number of samples represented in the VCF file, for which FORMAT data was provided.
- Return type:
- property sample_names
The names of the samples represented in the VCF if it was encoded with format IDs.
- property info_ids
The INFO fields available for
info()
that were encoded inbuild_vcfbin()
orfrom_vcf()
.
- info(info_id)[source]
The full per-variant data array associated with an INFO field. String INFO fields are accessed by
VCFVariant
attribute, eg.variant.CIGAR
.- Parameters:
info_id (
str
) – The INFO ID from the VCF, such as AD or DP.- Returns:
A complete array of INFO data values associated with info_id.
If there is one value per variant, the results is 1-dimensional:
INFO DP=1 DP=4 [1 4]
If there are multiple values per variant, the results is 2-dimensional:
INFO AD=0,1 AD=2,2 [[0 1] [2 2]]
- Return type:
- Raises:
KeyError – Raised if info_id is unrecognized, because it was not in the original VCF or included by
build_vcfbin()
TypeError – Raised if info_id refers to a String valued column. String values are accessed by
VCFVariant
attribute.
- property format_ids
The INFO fields available for
format()
that were encoded inbuild_vcfbin()
orfrom_vcf()
.
- format(format_id)[source]
The full per-variant per-sample data matrix associated with a FORMAT field.
- Parameters:
format_id (
str
) – The FORMAT ID from the VCF, such as AD or GT.- Returns:
A complete array of FORMAT data values associated with format_id.
If there is one value per variant and per sample, the results is 2-dimensional:
sample1 sample2 sample3 1 2 3 4 5 6 [[1 2 3] [4 5 6]]
If there are multiple values per variant and per sample, the results is 3-dimensional:
sample1 sample2 sample3 1,2 3,4 5,6 7,8 9,10 11,12 [[[1 2] [3 4] [5 6 ]] [[7 8] [9 10] [11 12]]]
- Return type:
- Raises:
KeyError – Raised if format_id is unrecognized, either because it was not in the original VCF or included by
build_vcfbin()
.
- static build_vcfbin(outfile, infile, reference_genome, info_ids=None, fmt_ids=None, validate=True, exclude=None, allow=None, ancestral='error')[source]
Builds a binary
.vcfbin
file from a.vcf[.gz]
.The resulting file can be opened by
VCFTable
.VCF normalization. The input VCF should be normalized, and multi-allelic sites must be split. Consider running infile through
bcftools norm -m - -f
. Examples:REF ALT CT CAT # FAIL (not normalized) A # OK . A # OK - A # OK C CA # OK C A,G # FAIL (multi-allelic)
Extracting INFO/FORMAT columns. By default, INFO and FORMAT columns such as DP and AD are omitted. Use info_ids and fmt_ids arguments to specify which columns you want included:
["GT", "AD"] # Keep GT and AD (as int8 and int32 respectively) {"GT" : None, "AD" : None} # Same {"GT" : None, "AD" : -1} # Same, but store -1 for any missing AD value {"GT" : None, "AD" : np.int16(0) } # Keep GT and AD (as int8 and int16 respectively)
The GT (genotype) FORMAT and SVTYPE (structural variant type) INFO columns are special. They default to
int8
, and the string values are stored as integer constants:*GT*
0/0
and0|0
are stored asVCFTable.GT_HOMOZYGOUS_REF
(0)0/1
is stored asVCFTable.GT_HETEROZYGOUS_UNPHASED
(1)1/1
and1|1
are stored asVCFTable.GT_HOMOZYGOUS_ALT
(2)?/?
and?|?
are stored asVCFTable.GT_UNKNOWN
(3)0|1
is stored asVCFTable.GT_HETEROZYGOUS_PHASED_0_1
(4)1|0
is stored asVCFTable.GT_HETEROZYGOUS_PHASED_1_0
(5)
*SVTYPE*
A missing SVTYPE is represented as
VCFTable.SVTYPE_NA
(0)DEL
is stored asVCFTable.SVTYPE_DEL
(1)INS
is stored asVCFTable.SVTYPE_INS
(2)DUP
is stored asVCFTable.SVTYPE_DUP
(3)INV
is stored asVCFTable.SVTYPE_INV
(4)CNV
is stored asVCFTable.SVTYPE_CNV
(5)BND
is stored asVCFTable.SVTYPE_BND
(6)
The ID INFO column is stored as an INFO ID. If it is required (e.g., you need to find a MATEID pair), pass “ID” into info_ids.
Missing values are also treated differently: GT values must be specified as a diploid. If a single missing allele (e.g.
0/.
,./0
,1/.
, or./1
) should be encoded as a possibleVCFTable.GT_HETEROZYGOUS_UNPHASED
instead ofVCFTable.GT_UNKNOWN
, the missing value should be set toVCFTable.GT_HETEROZYGOUS_UNPHASED
../.
will still be interpreted asVCFTable.GT_UNKNOWN
:{"GT" : VCFTable.GT_HETEROZYGOUS_UNPHASED}
For phased values, the default value is ignored. For example,
.|1
will be encoded asVCFTable.GT_UNKNOWN
.All other columns default to
int32
,float32
, orstr
as defined by the VCF Type header (can override bitwidth for performance or to save disk space.) For integer/float columns, only constant value dimensions are currently supported. By default, a missing value is substituted with 0, 0.0, or “”. If missing integer values must be distinguishable, consider using-(2**31)
to(2**31+7)
as these are reserved by VCF.String-valued columns String values can be accessed by attribute (
variant.CIGAR
). Unlike numeric columns, a string column cannot be accessed as a numpy array.Caveats: Multi-dimensional values are not split, and ASCII-encoded VCF values, eg.
%3A
for:
, will not be decoded..
are not decoded as empty strings unless it is the sole value.Value multiplicity. Each INFO/FORMAT field has an expected number of values per sample, following the VCF 4.2 Specification sections 1.4.2 and 1.6.2. For example:
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths">
Here, a FORMAT value of
0/1:5,8
for a single sample would be valid because we expect one GT value per (Number=1
) and two AD values (Number=R
).FORMAT values with multiple samples or INFO/FORMAT values that are multidimensional can only be accessed via
info()
orformat()
and cannot be accessed byVCFVariant
attributes.Beware VCF headers that incorrectly specify
Number=.
instead ofR
, as their multi-allelic variants will be split incorrectly bybcftools
.- Parameters:
outfile (
str
) – The binary output file. This file can then be opened byVCFTable
.infile (
str
) – The path to the source.vcf[.gz]
file.reference_genome (
Genome
) – The reference genome that infile uses as#reference
field. The parameter is not validated against the VCF, so it is up to the caller to ensure it is correct. For example, a#reference=b37.fa
or=NCBI37
is a match to"hg19"
.info_ids (
list
ofstr
|dict
ofstr
toobject
, optional) – Controls how INFO columns are preserved in outfile. See above.fmt_ids (
list
ofstr
|dict
ofstr
toobject
, optional) – Controls how FORMAT columns are preserved in outfile. See above.validate (
bool
, optional) – Whether to validate REF sequences against the reference genome. Default is True.exclude (
list
ofInterval
) – Optional. Variants within these intervals will be excluded.allow (
list
ofInterval
) – Optional. Only variants within these intervals will be included, so long as they are not excluded.ancestral (
str
) – Behaviour when encountering ancestral alleles. Valid values are “error”, “warn”, and “exclude”. Defaults to “error”.
- Raises:
Under the following conditions:
The VCF is malformed.
A row in infile does not contain every FORMAT ID specified by fmt_ids.
An INFO/FORMAT ID is unsupported.
A REF or ALT sequence contained a VCF non-standard nucleotide code.
A REF sequence failed to validate.
An INFO/FORMAT field type is incompatible with the VCF Type header definition.
- static vcfbin_version()[source]
The vcfbin file format version that this build supports.
Attempting to open an vcfbin file of a mismatched version will raise an
IOError
.- Returns:
The version number.
- Return type:
- sequence_variations(interval)[source]
Find variants that could affect the given interval’s DNA sequence. When specified with an anchor, deletion variants will increase the interval’s span; this can be overestimated (due to overlapping deletions), but the spurious variants will be filtered during DNA sequence retrieval.
The results will also be sorted by 5’.
- Parameters:
interval (
Interval
) – The dna sequence interval. Anchoring will expand the interval when deletion variants are encountered.- Returns:
A list of variants that possibly affect the given interval’s DNA sequence.
- Return type:
list
ofVCFVariant
- close()[source]
Close the file handle.
Note that if you use a with statement the file handle is automatically closed:
# Open the file with VCFTable('foo.vcfbin') as vcfbin: ... # <-- File is now closed
Notes
Closing this table will invalidate all VCFVariants previously accessed via __getitem__.
- property filename
The path to the file from which FORMAT field matrices are retrieved.
- Returns:
The path to the VCF file.
- Return type:
VCFVariant
- class VCFVariant[source]
A VCF variant. Scalar INFO/FORMAT fields can be accessed as a dynamic attribute, eg.
variant.AD
.Bases:
Variant
- as_variant_string()[source]
Returns a string representation of the variant.
See also
- Returns:
The variant in the format
"chromosome:start:ref:alt"
, where start is in DNA1 coordinates.- Return type:
- __str__()
Returns a string representation of the variant.
See also
- Returns:
The variant in the format
"chromosome:start:ref:alt"
, where start is in DNA1 coordinates.- Return type:
VariantTable
- class VariantTable[source]
- find_5p_aligned(interval)[source]
Returns all variants that have 5’ end aligned with 5’ end of interval.
This function is a faster version of the brute-force filter:
[v for v in variants if v.end5 == interval.end5]
The results will also be sorted by 5’.
- find_3p_aligned(interval)[source]
Returns all variants that have 3’ end aligned with 3’ end of interval.
This function is a faster version of the brute-force filter:
[v for v in variants if v.end3 == interval.end3]
The results will also be sorted by 3’.
- find_5p_within(interval)[source]
Returns all variants that have 5’-most position within interval.
This function is a faster version of the brute-force filter:
[v for v in variants if v.end5.expand(0, 1) in interval]
The results will also be sorted by 5’.
- find_3p_within(interval)[source]
Returns all variants that have 3’-most position within interval.
This function is a faster version of the brute-force filter:
[v for v in variants if v.end3.expand(1, 0) in interval]
The results will also be sorted by 3’.
- find_within(interval)[source]
Returns all variants that fall entirely within interval.
This function is a faster version of the brute-force filter:
[v for v in variants if v in interval]
The results will also be sorted by 3’.
- find_overlapping(interval)[source]
Returns all variants that overlap interval.
This function is a faster version of the brute-force filter:
[v for v in variants if v.overlaps(interval)]
The results will also be sorted by 5’.
- find_exact(interval)[source]
Returns all variants that span exactly interval.
This function is a faster version of the brute-force filter:
[v for v in variants if v.interval == interval]
The results will also be sorted by 5’ and then 3’.
- property stranded
If True then the strand is significant when calling the find_x methods.
- Returns:
Whether this table can contain negative stranded intervals.
- Return type:
- __getitem__(index)[source]
Access to all variants.
Allows iteration over all variants in the table:
for variant in table: ...
The results will be ordered as in the source file.
Data Access
genome_kit.DataManager
- class DataManager(data_dir: str)[source]
- __init__(data_dir: str)[source]
Implementations should accept a data directory. :param data_dir: location where local files are cached
- get_file(filename: str) str [source]
Get path to a managed data file and download if necessary.
- Parameters:
filename – A single filename, without a path.
- Return type:
The path to the local version of the file.
- upload_file(filepath: str, filename: str, metadata: Dict[str, str] = None)[source]
Upload a file to make it available in GenomeKit.
- Parameters:
filepath – Full path to the file to be uploaded.
filename – The remote filename.
metadata – Extra metadata to attach to the file.
- __weakref__
list of weak references to the object
genome_kit.DefaultDataManager
- class DefaultDataManager(data_dir: str, bucket_name='genomekit-data-public', require_auth=False)[source]
A minimal data manager implementation that retrieves files from a S3 bucket. When using the default bucket, uploads are only supported for GenomeKit maintainers.
- __init__(data_dir: str, bucket_name='genomekit-data-public', require_auth=False)[source]
- Parameters:
data_dir – location where local files are cached
bucket_name – S3 bucket. By default, use the bucket sponsored by AWS Open Data
require_auth – set True for non-public buckets
- get_file(filename: str) str [source]
Get path to a managed data file and download if necessary.
- Parameters:
filename – A single filename, without a path.
- Return type:
The path to the local version of the file.
genome_kit.GCSDataManager
- class GCSDataManager(data_dir: str, bucket_name: str)[source]
A minimal data manager implementation that retrieves files from a GCS bucket.
- __init__(data_dir: str, bucket_name: str)[source]
- Parameters:
data_dir – location where local files are cached
bucket_name – GCS bucket
- get_file(filename: str) str [source]
Get path to a managed data file and download if necessary.
- Parameters:
filename – A single filename, without a path.
- Return type:
The path to the local version of the file.
Internals
genome_kit.gk_data
This module automatically fetches GenomeKit data.
A user should first use upload_file()
to upload a
file to make it available other GenomeKit users.
After the upload, the file can be used by any GenomeKit function via
get_file()
, which will download the file on demand
and return its local path.
Example
>>> upload_file('/local/path/hg38.2bit', 'hg38.2bit')
>>> get_file('hg38.2bit')
"/Users/example/Application Support/genome_kit/hg38.2bit"
- get_file(filename)[source]
compatibility wrapper for
get_file()
- upload_file(filepath: str, filename: str, metadata: Dict[str, str] = None)[source]
compatibility wrapper for
upload_file()
- wget(url, dst=None, timestamping=False, progress=False)[source]
Download a file from the URL.
If dst is unspecified, the file will be downloaded to the system temp directory and given the same name.
If the last-modified time of the remote file is available, it will be assigned to the downloaded file.
- Parameters:
url (
str
) – The URL of the source file.dst (
str
) – The path to the local destination.timestamping (
bool
) – If true, only download when last-modified date if the remote file is newer than the local file. If the server does not provide the last-modified date, the local file will be kept by default.progress (
bool
) – If true, show a progress bar.
- Returns:
The path to the local version of the file.
- Return type:
- Raises:
IOError – There was an error opening dst.
HTTPError – There was an error opening the URL.
socket.timeout – The connection timed out or was interrupted.
genome_kit.data_manager
- class ProgressPercentage(filename, filesize=None, progress_msg='Progress')[source]
Display a progress bar on a text terminal showing file being transferred, completion percentage, estimated transfer time and throughput.
`Progress "myfile.bin" [ 20.0 B/ 20.0 B] 100% |###############| ETA: 0:00:00 147.8 B/s`
- __weakref__
list of weak references to the object
genome_kit._twobitutils
genome_kit._apply_variants
- apply_variants(sequence, variants, interval, reference_alignment=False)[source]
Apply variants to a sequence interval.
This function ignores the interval’s strand attribute and always returns the sequence on the reference strand.
genome_kit._cxx_util
- mock(arg)[source]
Marks a class, attribute, or method as being a ‘mock’.
When Python-defined class A inherits from C++-defined class B, the attributes and methods B are only partially visible to the IDE.
The mock function allows the definition of A to mock the attributes and methods of B, making them accessible the IDE (for autocomplete) and to automatic doc extraction (e.g. sphinx).
For example, the C++-defined
genome_kit._cxx.GeneTable
implements __getitem__, which returns an instance of the Python-defined typegenome_kit.Gene
. The Python-defined classgenome_kit.GeneTable
is defined as:@_cxx.register class GeneTable(_cxx.GeneTable): @mock def __getitem__(self, index): "Returns a Gene object" return mock_result(Gene)
The IDE and autodocs ignore the
@mock
decorator, so they treat this definition at face value and do static analysis, propagating the return type (genome_kit.Gene
) when doing autocomplete:for gene in genome.genes: # genes is of type GeneTable gene.<autocomplete> # gene is of type Gene
When a class is registered with the C++ backend, all mock properties and methods are deleted from the type, so that all requests for those attributes will fall through directly to the C++ implementation in the base class.
See the
PyDeleteMockAttrs
C++ function for the code that strips some mock symbols from each registered type.
- mock_result(result_type)[source]
Return a mock result of one or more types.
This function is used to give IDEs a hint during static analysis.
This function merely calls
mock_unreachable()
before returning an instance of result_type. It’s important to print the error message before trying to instantiate the return type.See also:
mock()
- mock_unreachable()[source]
Prints an unreachable code warning message.
If a mock attribute or method is ever called, it means there’s some mismatch between a Python-defined type and a C++-defined type.
An alternative behaviour would be to raise an exception, but that causes headaches when pylint and PyCharm try to warn about unreachable code.
The reason this function is provided separately from mock_result is that it allows the user to either return nothing:
def foo(self): mock_unreachable()
or to return an object that provides more complex type hints:
def foo(self): mock_unreachable() return [str()] # type: List[str]
See also:
mock_result()
- strip_mock_bases(subtype)[source]
Strips the last base classes at runtime from the given type.
Given a class hierarchy X->Y->Z where X and Z are defined in Python but Y is defined in C++, then the IDE does not know that Z inherits from X, and so it cannot provide autocomplete for members of X.
This decorator is a hacky way to trick IDEs into statically parsing out the inheritance structure, while removing the ‘trick’ at runtime.
For example, the class hierarchy we want for Gene is simply:
@_cxx.register class Gene(_cxx.Gene): # Interval -> _cxx.Gene -> Gene ...
However, IDEs will not see that Gene inherits from Interval. So, instead we declare Gene as follows:
@strip_mock_bases @_cxx.register class Gene(_cxx.Gene, Interval): ...
The IDE will now think that Gene inherits Interval via multiple inheritance, and will IDE autocomplete accordingly, even though at runtime we’ll get the linear inheritance scheme that we want.