API Documentation

Interval

class Interval(chromosome, strand, start, end, reference_genome, anchor=None, anchor_offset=0)[source]

A genomic interval.

An Interval represents a contiguous stranded genomic interval, identified by a chromosome, strand, start position, end position, and reference genome.

__init__(chromosome, strand, start, end, reference_genome, anchor=None, anchor_offset=0)[source]

Initialize an interval from a DNA0 position.

Parameters:
  • chromosome (str) – The chromosome name (‘chr1’, …).

  • strand (chr) – The strand (“+” or “-“).

  • start (int) – The start position of the interval (inclusive, 0-based).

  • end (int) – The end position of the interval (exclusive, 0-based).

  • reference_genome (str | Genome) – The reference genome in which the interval is defined, for example “hg19”. If a genome object is given, its reference genome will be used.

  • anchor (int | str | Interval | None, optional) – The anchor is only relevant when extracting DNA from a VariantGenome and specifies how length-changing variants are applied. Conceptually, the anchor defines a fixed point when the sequence changes length. See (TODO: link) for the details. The anchor can be any position in the interval. If anchor is None, then length-changing variants will change the length of the interval. Anchor can be a string: 5p or 3p sets the anchor to the respective end of the interval, respecting the sense strand. start or end set the anchor to the start or end of the interval on the forward strand. When anchor is center then the anchor is set to the middle position of the interval (if the interval has odd length, the position is rounded towards the 5p end of the interval, respecting the sense strand). If anchor is an integer or an empty interval, the anchor is set to that position.

  • anchor_offset (int, optional) – The anchor offset is a shift that’s applied to the anchor only on the variant genome. The purpose is to align a position on the reference genome with a position inside an insertion. The anchor itself would only allow to align with the beginning or end of the insertion.

property chromosome

The chromosome name.

A shorthand property chrom is also available.

Returns:

The name of the chromosome this interval is defined on (e.g. “chr1”).

Return type:

str

property strand

The strand (“+” or “-“).

Returns:

The strand this interval is defined on.

Return type:

chr

property start

The integer position of the start (inclusive, 0-based).

Returns:

The start position of this interval within the reference genome.

Return type:

int

property end

The integer position of the end (exclusive, 0-based).

Returns:

The end position of the this interval within the reference genome.

Return type:

int

property end5

The 5p end of this interval, which is a length-0 interval.

Not to be confused with the end attribute, which

Returns:

The length-0 interval at the 5p end of the interval, i.e. the gap preceding the most upstream base in the interval.

Return type:

Interval

property end3

The 3p end of this interval, which is a length-0 interval.

Returns:

The length-0 interval at the 3p end of the interval, i.e. the gap succeeding the most downstream base in the interval.

Return type:

Interval

property reference_genome

The reference genome (assembly name in UCSC format).

A shorthand property refg is also available.

Returns:

The name of the reference genome this interval is defined in.

Return type:

str

property anchor

The anchor position.

Returns:

The anchor position associated with this interval, if any.

Return type:

int | None

property anchor_offset

The anchor offset.

Returns:

The anchor offset associated with this interval, if any.

Return type:

int

shift(amount)[source]

The interval shifted upstream/downstream.

Parameters:

amount (int) – The amount by which to shift the start and end positions. A positive amount shifts downstream according to the current strand, and a negative amount shifts the position upstream.

Returns:

The shifted interval.

Return type:

Interval

expand(upstream, dnstream=None)[source]

The interval expanded upstream/downstream.

Can also be called with a single argument, in which case upstream and dnstream take on the same value.

Parameters:
  • upstream (int) – Number of upstream positions to add to the interval.

  • dnstream (int) – Number of downstream positions to add to the interval. If not specified, defaults to upstream (i.e. symmetric).

Returns:

An interval containing upstream + downstream + 1 positions.

Return type:

Interval

intersect(other)[source]

The interval intersecting both intervals.

Parameters:

other (Interval) – interval to intersect with this interval.

Returns:

The intersecting interval (possibly empty) or None if the intervals are disjoint

Return type:

Interval | None

subtract(other: Interval) Sequence[Interval][source]

A list of intervals, representing this interval with the its intersection of another interval removed.

Parameters:

other – interval to subtract its intersection from this interval.

Returns:

The intervals after subtraction. If the intervals were disjoint, returns [self].

Return type:

Sequence[Interval]

is_positive_strand()[source]

Test if this interval is on the positive strand.

Returns:

True if strand is equal to ‘+’

Return type:

bool

as_positive_strand()[source]

The same interval on the positive strand.

Returns:

The same interval, but on the positive (‘+’) strand.

Return type:

Interval

as_negative_strand()[source]

The same interval on the negative strand.

Returns:

The same interval, but on the negative (‘-’) strand.

Return type:

Interval

as_opposite_strand()[source]

The same interval on the opposite strand.

Returns:

The same interval, but on the opposite strand.

Return type:

Interval

upstream_of(other)[source]

Test if this interval is strictly upstream of other.

Parameters:

other (Interval) – Any coordinate or interval subclass defined on the same reference genome and strand as this interval.

Returns:

True if this interval is strictly upstream of other, with no overlap.

Return type:

bool

dnstream_of(other)[source]

Test if this interval is strictly downstream of other.

Parameters:

other (Interval) – Any coordinate or interval subclass defined on the same reference genome and strand as this interval.

Returns:

True if this interval is strictly downstream of other, with no overlap.

Return type:

bool

contains(other)[source]

Test if other lies within the extents of this interval.

Parameters:

other (Interval) – Any coordinate or interval subclass defined on the same reference genome and strand as this interval.

Returns:

True if other lies entirely within the extents of this interval.

Return type:

bool

within(other)[source]

Test if this interval lies within the extents of other.

Use of interval in other expressions is also available.

Parameters:

other (Interval) – Any coordinate or interval subclass defined on the same reference genome and strand as this interval.

Returns:

True if this interval lies within the extents of other.

Return type:

bool

overlaps(other)[source]

Test if this interval overlaps the extents of other.

Parameters:

other (Interval) – Any coordinate or interval subclass defined on the same reference genome and strand as this interval.

Returns:

True if this interval overlaps the extents of other.

Return type:

bool

static from_dna0_coord(chromosome, strand, position, reference_genome)[source]

Create an interval spanning a single DNA0 position.

Equivalent to Interval(chromosome, strand, position, position+1, reference_genome).

Parameters:
  • chromosome (str) – The chromosome name (e.g. “chr1”).

  • strand (int) – The strand (“+” or “-“).

  • position (int) – The position the interval should span (0-based).

  • reference_genome (str | Genome) – The reference genome in which the interval is defined, for example “hg19”. If a genome object is given, its reference genome will be used.

Returns:

The resulting length-1 interval spanning position.

Return type:

Interval

static from_dna0(chromosome, strand, start, end, reference_genome)[source]

Create an interval from DNA0 positions.

Equivalent to calling Interval.

Parameters:
  • chromosome (str) – The chromosome name (e.g. “chr1”).

  • strand (int) – The strand (“+” or “-“).

  • start (int) – The start position of the interval (inclusive, 0-based).

  • end (int) – The end position of the interval (exclusive, 0-based).

  • reference_genome (str | Genome) – The reference genome in which the interval is defined, for example “hg19”. If a genome object is given, its reference genome will be used.

Returns:

The resulting interval.

Return type:

Interval

static from_dna1(chromosome, strand, start, end, reference_genome)[source]

Create an interval from DNA1 positions.

Positions are automatically converted to DNA0 as (start-1, end).

Parameters:
  • chromosome (str) – The chromosome name (e.g. “chr1”).

  • strand (int) – The strand (“+” or “-“).

  • start (int) – The start position of the interval (inclusive, 1-based).

  • end (int) – The end position of the interval (inclusive, 1-based).

  • reference_genome (str | Genome) – The reference genome in which the interval is defined, for example “hg19”. If a genome object is given, its reference genome will be used.

Returns:

The resulting interval.

Return type:

Interval

static from_rna1(chromosome, start, end, reference_genome)[source]

Create an interval from RNA1 positions.

The start/end positions are automatically converted to DNA0 as min(abs(start),abs(end))-1 and max(abs(start), abs(end))) respectively.

Parameters:
  • chromosome (str) – The chromosome name (e.g. “chr1”).

  • strand (int) – The strand (“+” or “-“).

  • start (int) – The start position of the interval (signed, inclusive, 1-based).

  • end (int) – The end position of the interval (signed, inclusive, 1-based).

  • reference_genome (str | Genome) – The reference genome in which the interval is defined, for example “hg19”. If a genome object is given, its reference genome will be used.

Returns:

The resulting interval.

Return type:

Interval

static spanning(interval1, interval2)[source]

Create the smallest interval spanning the given pair of existing intervals, or raise an exception if given intervals not on the same chromosome, strand, and reference genome. The given intervals are not required to be overlapping, and may both be of zero length.

Example:

>>> a = Interval("chr1", "+", 100, 150, "hg19")
>>> b = Interval("chr1", "+", 200, 250, "hg19")
>>> Interval.spanning(a, b)
Interval("chr1", "+", 100, 250, "hg19")

Notes

This method is not anchor aware, since that requires a list of variants for context. See spanning() for that use case.

Parameters:
  • interval1 (Interval) – an interval

  • interval2 (Interval) – another interval

Returns:

The resulting spanning interval.

Return type:

Interval

property midpoint

Identify the midpoint of an interval.

Example:

>>> Interval("chr1", "+", 100, 150, "hg19").midpoint
Interval("chr1", "+", 125, 125, "hg19")
>>> Interval("chr1", "+", 100, 151, "hg19").midpoint
Interval("chr1", "+", 125, 126, "hg19")
Returns:

The interval spanning the midpoint of the interval.

Return type:

Interval

distance(other, method='midpoint')[source]

The distance this interval and another.

Example:

>>> interval_a = Interval("chr1", "+", 100, 150, "hg19")
>>> interval_b = Interval("chr1", "+", 200, 300, "hg19")
>>> interval_c = Interval("chr1", "+", 400, 500, "hg19")
>>> interval_a.distance(interval_b)
125.0
>>> interval_a.distance([interval_b, interval_c])
[125.0, 325.0]
Parameters:
  • other (Interval or list) – an interval or list of intervals

  • method (str) – determines which (point) attribute of this and the other interval to use when computing the distance. Must be one of midpoint, end5, or end3.

Returns:

Distance to the interval. If other is of type Interval, return distance as a float; otherwise, a list of distances.

Return type:

float or list

as_dna0()[source]

Returns (start, end) positions using DNA0 convention.

Returns:

The positions (start, end) using DNA0 convention.

Return type:

tuple

as_dna1()[source]

Returns (start, end) positions using DNA1 convention.

Returns:

The positions (start, end) using DNA1 convention.

Return type:

tuple

as_rna1()[source]

Returns (start, end) positions using RNA1 convention.

Returns:

The positions (start, end) using RNA1 convention.

Return type:

tuple

as_ucsc()[source]

Returns this interval as a UCSC browser coordinate.

Returns:

A string in UCSC browser convention, e.g. "chr5:500201-5002010".

Return type:

str

with_anchor(anchor, anchor_offset=0)[source]

Returns an anchored version of this interval.

Returns:

An anchored version of this interval. The previous anchor, if any, is ignored.

Return type:

Interval

__repr__()[source]

Return repr(self).

__str__()[source]

Return str(self).

Variant

class Variant(chromosome, start, ref, alt, reference_genome)[source]

A variant.

Bases: Interval

__init__(chromosome, start, ref, alt, reference_genome)[source]

Initialize a variant from a DNA0 position.

Parameters:
  • chromosome (str) – The chromosome name (‘chr1’, …).

  • start (int) – The start of the reference genome spanned by ref (inclusive, 0-based).

  • ref (str) – The reference allele.

  • alt (str) – The alternate allele.

  • reference_genome (str | Genome) – The reference genome. See Interval.

property interval

The interval spanned by this variant in the reference genome.

Note that Variant also inherits from Interval.

Returns:

The interval spanned by this variant’s ref sequence in the reference genome.

Return type:

Interval

property position

The integer position of the start (inclusive, 0-based).

This property is a synonym for start and is provided for historical reasons. For consistency with other interval-based objects, prefer the start property.

Returns:

The start position of this interval within the reference genome.

Return type:

int

property ref

The reference allele as a DNA string.

Returns:

The sequence of the reference allele that is altered by this variant.

Return type:

str

property alt

The alternate allele as a DNA string.

Returns:

The sequence of the alternate allele that this variant represents.

Return type:

str

as_variant_string()[source]

Returns a string representation of the variant.

See also

from_string()

Returns:

The variant in the format "chromosome:start:ref:alt", where start is in DNA1 coordinates.

Return type:

str

__repr__()[source]

Return repr(self).

__str__()

Returns a string representation of the variant.

See also

from_string()

Returns:

The variant in the format "chromosome:start:ref:alt", where start is in DNA1 coordinates.

Return type:

str

static from_string(variant, reference_genome)[source]

Initialize a variant object from a variant string.

Parameters:
  • variant (str) – A variant in the format "chromosome:start:ref:alt" where start is in DNA1 coordinates.

  • reference_genome (Genome) – The reference genome that variant diverges from.

static spanning(x, y, variants=None)[source]

Extends spanning() such that it is aware of variants and anchors.

Example:

>>> from genome_kit import Genome, Variant, Interval
>>> hg19 = Genome("hg19")
>>> v = Variant("chr1", 100, "", 10 * "N", hg19)
>>> a = Interval("chr1", "+", 100, 150, hg19, 100)
>>> b = Interval("chr1", "+", 200, 250, hg19)
>>> Interval.spanning(a, b)
Interval("chr1", "+", 100, 250, "hg19")
>>> Variant.spanning(a, b, [v])
Interval("chr1", "+", 100, 260, "hg19", 100)
Parameters:
  • x (Interval) – an interval

  • y (Interval) – another interval

  • variants (list of Variant or None, optional) – variants to apply to x and y; variants cannot overlap.

Returns:

The resulting spanning interval.

Return type:

Interval

__getstate__() bytes[source]

Helper for pickle.

Genome

class Genome(config)[source]

Initialize a reference genome and its corresponding resources (DNA, annotations, etc).

A genome object represents a reference genome and resources available for it, such as DNA and annotations:

>>> genome = Genome("gencode.v19")
>>> exon = genome.exons[0]                                # 1st annotated exon.
>>> dna  = genome.dna(exon)                               # Extract DNA.

The initializer argument specifies a versioned resource:

# Access resources that are hg19
Genome("hg19")

# GENCODE v29 implies hg38
Genome("gencode.v29")

The currently recognized strings depend on the data made available by data_manager. Some common ones are:

reference_genomes = [
    "hg19",
    "hg19.p13.plusMT",
    "hg38",
    "hg38.p12",
    "hg38.p13",
    "hg38.p14",
    "mm10.p6",
    "mm39",
    "rn6",
    "macFas5",
    "susScr11",
]

annotations = [
    "gencode.v19",                  # implies hg19.p13.plusMT
    "gencode.v29",                  # implies hg38.p12
    "gencode.v29.basic",            # implies hg38.p12
    "gencode.v29lift37",            # implies hg19
    "gencode.v29lift37.basic",      # implies hg19
    "gencode.v41",                  # implies hg38.p13
    "gencode.v41.basic",            # implies hg38.p13
    "gencode.v41lift37",            # implies hg19
    "gencode.v41lift37.basic",      # implies hg19
    "ucsc_refseq.2017-06-25",       # implies hg19
    "ncbi_refseq.v105.20190906",    # implies hg19.p13.plusMT
    "ncbi_refseq.v109",             # implies hg38.p12
    "ncbi_refseq.v110",             # implies hg38.p14
    "gencode.vM19",                 # implies mm10.p6
    "gencode.vM19.basic",           # implies mm10.p6
    "gencode.vM30",                 # implies mm39
    "gencode.vM30.basic",           # implies mm39
    "gencode.vM31",                 # implies mm39
    "gencode.vM31.basic",           # implies mm39
    "ncbi_refseq.m39.v109",         # implies mm39
    "ensembl.Rnor_6.0.88",          # implies rn6
    "ncbi_refseq.Macfas_5.0.v101",  # implies macFas5
    "ensembl.Macfas_5.0.95",        # implies macFas5
    "ensembl.Sscrofa11.1.98",       # implies susScr11
]

If a resource is requested after object creation, but the version needed cannot be disambiguated from config, an exception is raised.

Parameters:

config (str) – Identifies a versioned resource that this genome object is expected to serve.

__init__(config)[source]
__reduce__()[source]

Helper for pickle.

static __new__(cls, config)[source]
interval(chromosome, strand, start, end)[source]

Shorthand method for creating an Interval with the same refg as the genome object called on.

variant(variant_string)[source]

Shorthand method for creating a Variant with the same refg as the genome object called on.

dna(interval, allow_outside_chromosome=True)[source]

Extract DNA from a reference genome.

If interval is on the negative strand, the reverse complement sequence is returned. If interval is outside the range of the chromosome, an IndexError is raised.

Parameters:
  • interval (Interval) – The query interval.

  • allow_outside_chromosome (bool) – If False, does not allow the interval to be outside the range of the chromosome. Attempting to pass an interval outside the chromosome range will raise an IndexError. If not specified, defaults to True. Any base outside the chromosome range will be returned as an ‘N’.

Returns:

The DNA sequence.

Return type:

str

variant_dna(interval, variants, allow_outside_chromosome=True)[source]

Extract DNA from a genome with specific variants applied to it.

Follows the same rules as dna(). Intended for applying sets of variants that may overlap interval.

Parameters:
Returns:

The variant DNA sequence.

Return type:

str

property annotation

Access to annotations.

A shorthand property anno is also available.

Returns:

An object that provides access to annotated genes, transcripts, and exons.

Return type:

GenomeAnnotation

property genes

Access to annotated genes.

This is shorthand for annotation.genes.

Returns:

An object that supports looping or queries over annotated genes.

Return type:

GeneTable

property transcripts

Access to annotated transcripts.

This is shorthand for annotation.transcripts.

A shorthand property trans is also available.

Returns:

An object that supports looping or queries over annotated transcripts.

Return type:

TranscriptTable

property exons

Access to annotated exons.

This is shorthand for annotation.exons.

Returns:

An object that supports looping or queries over annotated exons.

Return type:

ExonTable

property introns

Access to annotated introns.

This is shorthand for annotation.introns.

Returns:

An object that supports looping or queries over annotated introns.

Return type:

IntronTable

property cdss

Access to annotated coding sequences (CDSs).

This is shorthand for annotation.cdss.

Returns:

An object that supports looping or queries over annotated coding sequences.

Return type:

CdsTable

property utr5s

Access to annotated UTR5 sequences.

This is shorthand for annotation.utr5s.

Returns:

An object that supports looping or queries over annotated coding sequences.

Return type:

Utr5Table

property utr3s

Access to annotated UTR3 sequences.

This is shorthand for annotation.utr3s.

Returns:

An object that supports looping or queries over annotated coding sequences.

Return type:

Utr3Table

find_motif(interval, motif, match_position=0, find_overlapping_motifs=False)[source]

Find a genomic motif in an interval on a Genome.

Parameters:
  • interval (Interval) – A genomic interval to search for the motif.

  • motif (str) – A genomic motif to search for.

  • match_position (int | str) – Very often you want to find a position after a motif or in the middle of a motif. For example, when you are searching for a AG core acceptor site motif you want to match the 3p end of the AG motif to return the position of the putative splice site. To do this you can specify a match position as string with the values "3p" or "5p", or you can pass an integer that indicates a position in the motif with 0 <= match_position <= len(motif).

  • find_overlapping_motifs (bool) – Whether to return matches for overlapping motifs. The default is to only find non-overlapping motifs.

Returns:

A list of all matches (length-0 intervals) found in the query interval.

Return type:

list of Interval

Example

>>> genome37 = Genome('hg19')
>>> interval = genome37.interval('chr1', '+', 40000, 40080)
>>> motif = 'AA'
>>> genome37.find_motif(interval, motif)  
[Interval("chr1", "+", 40031, 40031, "hg19"),
 Interval("chr1", "+", 40061, 40061, "hg19")]
property reference_genome

The name of the reference genome.

Returns:

The name of the reference genome that any resources map to, e.g. “hg19” or “hg38”.

Return type:

str

property chromosomes

A list of valid chromosome names for this genome.

chromosome_size(chromosome)[source]

Returns the size of the given chromosome, in bp.

Returns:

The size of the given chromosome, as number of basepairs.

Return type:

int

property config

The configuration strings of this genome.

Returns:

The config used to initialize this object.

Return type:

str

property data_dir

The data_dir from which resources are opened.

Returns:

The path to where resource files are being opened.

Return type:

str

__repr__()[source]

Return repr(self).

appris_transcripts(gene=None)[source]

Returns a list of transcript objects ordered by our custom APPRIS index. If a gene is provided, returns a list of transcripts for that gene ordered by decreasing priority.

Parameters:

gene (Gene | str, optional) – Restrict to a specific gene. If using str, use the geneID.

Returns:

A list of transcripts that have APPRIS scores available. If a gene is provided as input, a list of transcripts for that gene in the order from most to least principle; so the first transcript should be used as the canonical.

Return type:

list of Transcript

Raises:

ApprisNotAvailableError: – APPRIS data is not available for this genome’s annotations.

appris_principality(transcript, as_string=False)[source]

Returns the APPRIS principality index (isoform specific number) for this transcript. An APPRIS principality index is an integer from 0..6 where 0 is considered most prinicpal and 6 considered least principal. In the original APPRIS database, the 7 principality levels are reported using strings, in the following order:

appris_index_names = [
    "PRINCIPAL:1",     # index 0
    "PRINCIPAL:2",     # index 1
    "PRINCIPAL:3",     # index 2
    "PRINCIPAL:4",     # index 3
    "PRINCIPAL:5",     # index 4
    "ALTERNATIVE:1",   # index 5
    "ALTERNATIVE:2",   # index 6
]

See the APPRIS database documentation for details.

The APPRIS version defaults to 2017_05.v22 most annotations; see _build_appris.REMOTE_FILES for the rest.

Parameters:
  • transcript (Transcript | str) – A transcript to query.

  • as_string (bool, optional) – Toggle True if the APPRIS principality string (e.g. “PRINCIPAL:1”) is desired.

Returns:

The APPRIS principality index [0..6] of this transcript, or the string representation if as_string=True

Return type:

int | str

Raises:

ApprisNotAvailableError: – APPRIS data is not available for this genome’s annotations.

__weakref__

list of weak references to the object

VariantGenome

class VariantGenome(reference_genome, variants)[source]

A reference genome with a number of variants applied to it.

A variant genome behaves like a Genome object, object, except any requests for DNA sequence (or quantities that depend on DNA sequence) will be returned with the variants already applied. If the variants change the length of the sequence, then the interval’s anchor and anchor_offset determine how the query interval will be aligned to the variant genome’s coordinate system.

TODO: see explanation of anchors.

__init__(reference_genome, variants)[source]

Initialize a variant genome.

Parameters:
  • reference_genome (Genome) – The reference genome to which variants are applied.

  • variants (Variant | list of (Variant)) – Either a single variant or a list of variants. Aligned-normalization of variants is not yet implemented, so the user is currently responsible for not passing variants that overlap. The variants must be provided as a Variant object. String variants are no longer allowed, please see Variant for more information. Since we assume that variants never overlap one another this means that all variant positions are with respect to the reference genome.

dna(interval, allow_outside_chromosome=True)[source]

Extract variant DNA for an anchored interval or coordinate.

Parameters:

interval (Interval) – If interval is on the negative strand, the reverse complement sequence is returned. The interval.anchor and interval.anchor_offset attributes determine the policy of how length-changing variants are applied.

Returns:

The variant DNA sequence.

Return type:

str

find_motif(interval, motif, match_position=0, find_overlapping_motifs=False)[source]

Find a genomic motif in an interval on a VariantGenome.

This method finds all occurrences of a genomic motif on a VariantGenome. It can handle all kinds of variants including indels and complex variants. When a motif is found within an insertion, it will set the anchor_offset attribute on the returned Interval.

Parameters:
  • interval (Interval) – A genomic interval to search for the motif.

  • motif (str) – A genomic motif to search for.

  • match_position (int | str) – Very often you want to find a position after a motif or in the middle of a motif. For example, when you are searching for a AG core acceptor site motif you want to match the 3p end of the AG motif to return the position of the putative splice site. To do this you can specify a match position as string with the values "3p" or "5p", or you can pass an integer that indicates a position in the motif with 0 <= match_position <= len(motif).

  • find_overlapping_motifs (bool) – Whether to return matches for overlapping motifs. The default is to only find non-overlapping motifs.

Returns:

A list of all matches (length-0 intervals) found in the query interval.

Return type:

py:class`list` of Interval

Example

>>> from genome_kit import Genome, Variant
>>> genome = Genome('hg19')
>>> variant = Variant.from_string("chr1:40033:A:G", genome)
>>> variant_genome = VariantGenome(genome, variant)
>>> interval = genome.interval('chr1', '+', 40000, 40080)
>>> motif = 'TAG'
>>> variant_genome.find_motif(interval, motif)
[Interval("chr1", "+", 40030, 40030, "hg19", 40030)]
__repr__()[source]

Return repr(self).

Tracks

GenomeDNA

class GenomeDNA[source]

Access to the DNA sequence of a reference genome.

This object should be accessed through the dna property of Genome:

>>> from genome_kit import Genome, Interval
>>> interval = Interval("chr5", "+", 50000, 50010, "hg19")
>>> genome = Genome("hg19")
>>> genome.dna(interval)
'TAAACCACAT'
__call__(interval, allow_outside_chromosome=True)[source]

Extract DNA from a reference genome.

If interval is on the negative strand, the reverse complement sequence is returned. If interval is outside the range of the chromosome, an IndexError is raised.

Parameters:
  • interval (Interval) – The query interval.

  • allow_outside_chromosome (bool) – If False, does not allow the interval to be outside the range of the chromosome. Attempting to pass an interval outside the chromosome range will raise an IndexError. If not specified, defaults to True. Any base outside the chromosome range will be returned as an ‘N’.

Returns:

The DNA sequence.

Return type:

str

property reference_genome

The name of the reference genome.

Returns:

The name of the reference genome from which DNA will be extracted, e.g. “hg38”.

Return type:

str

property filename

The path to the file from which DNA is extracted.

Returns:

The path to the file, e.g. “/data/hg19.2bit”

Return type:

str

__repr__()[source]

Return repr(self).

GenomeTrack

class GenomeTrack(infile)[source]

Access to a genomic data track.

A GenomeTrack provides a read-only view of a .gtrack file. A track can be easily opened and queried:

>>> interval = Interval('chr17', '+', 41246881, 41246886, 'hg19')
>>> track = GenomeTrack("hg19.phastcons_mammal.f4.gtrack")
>>> track(interval)
array([[ 1.   ],
       [ 0.857],
       [ 0.571],
       [ 0.357],
       [ 0.286]], dtype=float16)

>>> track(interval.as_opposite_strand())
array([[ 0.286],
       [ 0.357],
       [ 0.571],
       [ 0.857],
       [ 1.   ]], dtype=float16)

You can also open a track with a specific scope:

>>> with GenomeTrack("foo.gtrack") as track:
...     values = track(interval)

Tracks can be encoded in formats called etypes. The available etypes trade representation power for storage size and query speed:

etypes = [  # DECODABLE TYPES          RANGE                ENCODED TYPE
    "m0",   # float16/32, uint8, bool  {1}                  0-bit (mask)
    "u1",   # float16/32, uint8, bool  {0..1}               1-bit
    "u2",   # float16/32, uint8        {0..3}               2-bit
    "u3",   # float16/32, uint8        {0..7}               3-bit
    "u4",   # float16/32, uint8        {0..15}              4-bit
    "u5",   # float16/32, uint8        {0..31}              5-bit
    "u6",   # float16/32, uint8        {0..63}              6-bit
    "u8",   # float16/32, uint8        {0..255}             8-bit
    "i8",   # float16/32, int8         {-128..127}          8-bit
    "f2",   # float16/32               [-65504,65504]       2-bit dictionary
    "f3",   # float16/32               [-65504,65504]       3-bit dictionary
    "f4",   # float16/32               [-65504,65504]       4-bit dictionary
    "f5",   # float16/32               [-65504,65504]       5-bit dictionary
    "f6",   # float16/32               [-65504,65504]       6-bit dictionary
    "f8",   # float16/32               [-65504,65504]       8-bit dictionary
    "f16",  # float16/32               [-65504,65504]       16-bit float
    "f32",  # float16/32               [-1.2e-38, 3.4e38]   32-bit float
]

The default dictionary for f2..8 types is the range [0,1] computed as np.linspace(0.0, 1.0, 2**num_bits, dtype=np.float16).

Decoded types are called dtypes, and are indeed represented by numpy dtypes:

dtypes = [
    np.bool_,    # {False, True}
    np.uint8,    # {0..255}
    np.int8,     # {-128..127}
    np.float16,  # [-65504, 65504] IEEE 754 half-precision
    np.float32,  # [-1.2e-38, 3.4e38]
]

New tracks can be can created using GenomeTrackBuilder.

__init__(infile)[source]

Open a .gtrack file.

Parameters:

infile (str) – The .gtrack file to open.

__call__(interval, dtype=None, out=None)[source]

Extract data from a genomic track.

Data is always returned in sense-strand order, meaning it is always sensitive to the strand of the interval, even on an unstranded track.

You can optionally specify the dtype that a track should be decoded to.

Parameters:
  • interval (Interval) – The stranded query interval.

  • dtype (type) – Optional. The numpy dtype of the resulting array. Ignored if out is provided.

  • out (ndarray) – Optional. A numpy array to hold the result. If provided, it overrides dtype and it must have a shape that the inputs broadcast to. If not provided or None, a freshly-allocated array is returned.

Returns:

The decoded track data with len(interval) rows and dim columns.

Return type:

ndarray

property dim

The dimensionality of the track.

Returns:

The number of columns in each returned array.

Return type:

int

property resolution

The resolution of the track.

Returns:

The number of genomic positions spanned by each datum.

Return type:

int

property stranded

The strandedness of the track.

Returns:

Whether the negative strand stores data unique from the positive strand.

Return type:

bool

property etype

The encoding type of the track.

Returns:

A string identifying the encoding type.

Return type:

str

property dtype

The default dtype for the track.

Returns:

The numpy dtype that this track will decode to by default.

Return type:

type

property reference_genome

The name of the reference genome this track is mapped to.

Returns:

The name of the reference genome, e.g. “hg38”.

Return type:

str

property filename

The path to .gtrack file from which data is extracted.

Returns:

The path to the file, e.g. “/data/phastcons.gtrack”

Return type:

str

intervals()[source]

The list of intervals covered by this track.

Returns:

The list of intervals with data in this track.

Return type:

list of Interval

histogram(region, separate_dims=False)[source]

Generate a histogram of the values in the track.

Returns a dictionary where each key is a float appearing at least once in the track, and each value is a count of the number of times it occurred. NaN values are omitted.

If region is a Genome, computes a genome-wide histogram. If region is an interval or list of intervals, computes the histogram over that region only.

If separate_dims is true, returns a list of dim histograms, one for each track dimension.

Parameters:
  • region (Genome | Interval | list of Interval) – Specific interval(s) upon which to compute the histogram.

  • separate_dims (bool) – Optional. Whether to generate a single histogram over all dimensions, or to return a separate histogram for each.

Returns:

A dictionary { value : count } where the value appearing in the track count times. Zero

Return type:

dict | list of dict

close()[source]

Close the file handle.

Note that if you use a with statement the file handle is automatically closed:

# Open the file
with GenomeTrack('foo.gtrack') as track:
    ...
# <-- File is now closed
static gtrack_version()[source]

The gtrack file format version that this build supports.

Attempting to open a gtrack file of a mismatched version will raise an IOError.

Returns:

The version number.

Return type:

int

__repr__()[source]

Return repr(self).

GenomeTrackBuilder

class GenomeTrackBuilder(outfile, etype, strandedness, reference_genome, dim=1, resolution=1)[source]

Builds a .gtrack file.

Creating a .gtrack file with GenomeTrackBuilder generally involves four steps:

  1. Open the file for writing.

  2. Set further configuration options.

  3. Set data on intervals, either as numpy arrays or from disk.

  4. Call finalize().

For example, assuming we had a list of genomic intervals and corresponding data arrays we wanted to fill them with:

# 1. Open a .gtrack 8-bit encoding of floats in range [0, 1]
track = GenomeTrackBuilder("foo.gtrack", "f8", "single_stranded", Genome("hg19"))

# 2. Configure the track further
track.set_default_value(0)          # Fill gaps in the data with 0

# 3. Set data for each interval
for interval, data in entries:
    track.set_data(interval, data)

# 4. Finish writing the file
track.finalize()

Tracks can have any dimension (the dim argument) or resolution (the resolution argument). Resolution allows track data to be specified and stored at coarser than 1bp resolution.

For example, a track with resolution=5 can load a WIG file with step/span of 5bp:

fixedStep chrom=chr1 start=1 step=5 span=5
0.57
0.23
...

The resulting track is equivalent to the following:

0    1    2    3    4    5    6    7    8    9    ...  <-- position
0.57 0.57 0.57 0.57 0.57 0.23 0.23 0.23 0.23 0.23 ...  <-- decoded value
__init__(outfile, etype, strandedness, reference_genome, dim=1, resolution=1)[source]

Open a .gtrack for writing.

Parameters:
  • outfile (str) – The .gtrack file to create.

  • etype (str) – The encoding format. See GenomeTrack for details.

  • strandedness (str) –

    Determines the order by which the data is applied.

    • "single_stranded": both strands share the same data. The data is applied in Interval coordinate (reference strand) order.

    • "strand_unaware": ignores the Interval strand, data is applied in Interval coordinate (reference strand) order.

    • "strand_aware": data is applied from 5” end to 3” end (sense strand order).

    Negative strand data from bedgraph or wig files must be provided from a separate file. When calling set_data_from_bedgraph() or set_data_from_wig(), set "strand_unaware".

  • reference_genome (Genome) – The reference genome for the track data to build.

  • dim (int) – Optional. The number of dimensions in the resulting track.

  • resolution (int) – Optional. The resolution of the track data, in genomic positions.

set_default_value(value)[source]

Set the fill value for gaps in the track.

The default value does not need to be encodable. For example, a default value of -1.0 for a dictionary that does not contain -1.0 is fine.

Parameters:

value (float) – The value to fill gaps with.

set_sparsity(min_run=48, min_delta=0.0)[source]

Enable automatic sparsification of data intervals.

Reduces file size of tracks that are dominated by default_value, with non-default values appearing at sparse intervals. Highly recommended for variableStep WIG files.

The default behaviour of a track is to encode all data passed to set_data(). After calling set_sparsity(), the encoder will analyze the data arrays passed in and avoid encoding runs of default_value.

Specifically, if there exists a run of at least min_run values such that each abs(data[i] - default_value) <= min_delta, then that entire run will be treated as default_value and not explicitly stored in the file.

Note that, in order to exclude a run from the track, a 12-byte entry must be addd to the track index, so setting min_run to very small values can actually increase file size and will definitely increase query time.

Parameters:
  • min_run (int) – Optional. The minimum length of a run before it’s excluded.

  • min_delta (float) – Optional. The minimum delta for a value to be considered distinct from default_value.

set_clamping()[source]

Clamp all values to the encodable range.

By default, a value outside the encodable range will trigger an error, to notify the user that his/her data may not be encoded properly by the current format.

Calling this method shows intent that values should be clamped to the encodable range. Currently allowed only for f-type encodings.

set_dict(dict)[source]

Set the dictionary for etypes f2..8.

The dictionary must be a float16/32 numpy array. For etype f<n>, there must be 2**n entries in the array. Numerical values must be in non-decreasing order. The last entry in dict can be np.nan if you wish to be able to encode NaN at specific positions.

Parameters:

dict (np.ndarray) – The array of values comprising the dictionary.

set_transform(a, b, c, d)[source]

Transform data from range [a, b] to range [c, d].

This specifies that any time data is about to be encoded, it will undergo an affine transformation from numeric range [a, b] to range [c, d].

Parameters:
  • a (float) – The source range.

  • b (float) – The source range.

  • c (float) – The target range.

  • d (float) – The target range.

set_restriction(restriction)[source]

Restrict data to a specific interval.

The restriction interval acts as a allowed region, so that all data outside that region is either ignored or cropped away by set_data().

The restriction interval’s strand is ignored, i.e. the restriction always allows data on either strand.

This method is useful for making small version of full-sized data pipeline, for the sake of creating unit tests or iterating faster during development.

If the restriction interval is not aligned with the track resolution, some then the restriction interval will be expanded until it is aligned.

Parameters:

restriction (Interval) – The interval on which to keep data.

set_data(interval, data)[source]

Set the track data on a specific interval.

If the interval overlaps a previously-specified interval, an error is raised.

The dtype of the data array must be compatible with the etype of the track being built.

Normally, data must be an NxM array where N is the length of interval and M is dim.

If track resolution R is greater than 1, then data must be (N/R)xM where N is the length of interval. The start and end of interval must be aligned to R. In other words, data is supplied (and stored) at a coarsened resolution.

Parameters:
  • interval (Interval) – The interval to fill with data.

  • data (np.ndarray) – The data array, ordered according to the strandedness argument passed to the builder’s constructor.

set_data_from_wig(pos_strand_file, neg_strand_file=None)[source]

Load all data from a .wig or .wig.gz file.

Loads all data from a fixedStep or variableStep WIG file.

Set strandedness="strand_unaware" in __init__().

The number of data columns in the WIG must match the track dimension. The span of the data must match the track resolution, with the exception of the very last datum on a chromosome.

Adjacent WIG intervals will automatically be merged into a single GTRACK interval, with different data encoded for each spanned position. Consider calling set_sparsity().

Parameters:
  • pos_strand_file (str) – Path to the .wig file for positive strand.

  • neg_strand_file (str) – Optional. Path to the .wig file for negative strand, if stranded.

set_data_from_bedgraph(pos_strand_file, neg_strand_file=None)[source]

Load all data from a .bedgraph or .bedgraph.gz file.

Loads all data from a BEDGRAPH file. A direct analogue of set_data_from_wig().

Set strandedness="strand_unaware" in __init__().

Parameters:
  • pos_strand_file (str) – Path to the .bedgraph file for positive strand.

  • neg_strand_file (str) – Optional. Path to the .bedgraph file for negative strand, if stranded.

set_data_from_bed(bedfile, categories=None)[source]

Load all data from a .bed or .bed.gz file.

Loads each interval from a BED file and incorporates it into the track.

If categories is not specified, then the value contained in the score column of the BED file will be stored in the corresponding interval.

If categories is specified, then the name column of the BED file determines the value stored on each interval. The interval’s name will be used to find an index in categories, and that index will be stored.

The strandedness of the track must match the BED file (i.e. an unstranded track expects the strand column to contain '.'). The track dimension must be 1. The start and end of each interval must match the track resolution, with the exception of the very last position of a chromosome.

Adjacent BED intervals will automatically be merged into a single contiguous GTRACK interval. In a GTRACK file, the data value of an interval is repeated for the size of the interval, unlike in a BED file where it’s specified only once per interval. This can result in large GTRACK files, especially when at 1bp resolution (the default). Consider using set_default_value() and set_sparsity() to manage file size.

Parameters:
  • bedfile (str) – Path to the .bed file.

  • categories (list of str) – Optional. List of categories to match with the name column in the BED file.

flush()[source]

Flush the currently accumulated data to disk.

This function can free up memory while building a track, but should only be called immediately after an entire chromosome has had its data set. Once this is called all chromosomes that currently have any data on them become fixed.

finalize()[source]

Finish writing the file and close it.

This function must be called last, to complete creation of the GTRACK file.

property dim

The dimensionality of the track.

Returns:

The number of columns in each returned array.

Return type:

int

property resolution

The resolution of the track.

Returns:

The number of genomic positions spanned by each datum.

Return type:

int

property stranded

The strandedness of the track.

Returns:

Whether the negative strand stores data unique from the positive strand.

Return type:

bool

property etype

The encoding type of the track.

Returns:

A string identifying the encoding type.

Return type:

str

property dtype

The default dtype for the track.

Returns:

The numpy dtype that this track will decode to by default.

Return type:

type

property reference_genome

The name of the reference genome the track will be mapped to.

Returns:

The name of the reference genome, e.g. “hg38”.

Return type:

str

property filename

The path to .gtrack file being written to.

Returns:

The path to the file, e.g. “/data/phastcons.gtrack”

Return type:

str

property data_size

The number of positions spanned by data in the GTRACK file.

If sparsity is disabled, the data size is the number of positions spanned by data via set_data().

Enabling sparsity may decrease the data size.

Using resolution > 1bp will also decrease the data size, since data is encoded into a coarse coordinate system, not at full genomic resolution.

The number of bits that the data occupies in the GTRACK file is approximately data_size * dim * bits_per_datum, but the exact computation is non-trivial, so portion of GTRACK file dedicated to data may be higher or lower than this rough estimate.

Returns:

The number of data spanned by encoded data.

Return type:

int

property index_size

The number of separate intervals in the GTRACK index.

If sparsity is disabled, the index size is the number of times set_data() was called.

Enabling sparsity may increase the index size.

Returns:

The number of unique intervals in the index.

Return type:

int

__repr__()[source]

Return repr(self).

Annotations

GenomeAnnotation

class GenomeAnnotation[source]

Annotations for a reference genome.

This object should not be directly created. Instead, create a Genome object:

genome = Genome("gencode.v19")
for gene in genome.annotation.genes:
    ...

The Genome object also exposes shortcut attributes, for convenience:

for gene in genome.genes:
    ...

See Genome for a list of annotations available, e.g. gencode.v29, ucsc_refseq.2017-06-25, etc.

Note

UCSC RefSeq can contain multiple versions of the same gene/transcript. These tend to be in ambiguously mapped regions. In GenomeKit, they’re retained as separate entries in the gene and transcript tables, with distinct intervals but otherwise identical.

For example, in GENCODE v19 the SMN1/2 genes are located on chromosome 5 at 70220767-70249769 and 69345349-69374349 respectively. In UCSC RefSeq, SMN1 is duplicated at both 70220767-70248838 and 69345349-69373418, so there are two gene entries named SMN1, each with their own version of transcripts NM_000344, NM_001297715, NM_022874. Similarly, SMN2 is duplicated at both 70220767-70248842 and 69345349-69373422.

property genes

An indexed table of annotated genes.

Returns:

An object that supports looping or queries over annotated genes.

Return type:

GeneTable

property transcripts

An indexed table of annotated transcripts.

A shorthand property trans is also available.

Returns:

An object that supports looping or queries over annotated transcripts.

Return type:

TranscriptTable

property exons

An indexed table of annotated exons.

Returns:

An object that supports looping or queries over annotated exons.

Return type:

ExonTable

property introns

An indexed table of annotated introns.

Returns:

An object that supports looping or queries over annotated introns.

Return type:

IntronTable

property cdss

An indexed table of annotated coding sequences (CDSs) within exons.

This is shorthand for annotation.cdss.

Returns:

An object that supports looping or queries over annotated coding sequences.

Return type:

CdsTable

property utr5s

An indexed table of UTR5s within exons.

This is shorthand for annotation.utr5s.

Returns:

An object that supports looping or queries over annotated coding sequences.

Return type:

UtrTable

property utr3s

An indexed table of UTR3s within exons.

This is shorthand for annotation.utr3s.

Returns:

An object that supports looping or queries over annotated coding sequences.

Return type:

UtrTable

property filename

The path to the file from which annotations are retrieved.

Returns:

The path to the file, e.g. “/data/gencode.v19.annotation.dganno”

Return type:

str

static binary_version()[source]

The binary file format version that this build supports.

Returns:

The version number.

Return type:

int

static build_gencode(infile, outfile, reference_genome)[source]

Build a binary version of GENCODE/Ensembl annotations from a GFF3 file.

Currently supports conversion from gff3[.gz].

In addition to the versioned dganno file, a cfg file will also be created.

Parameters:
  • infile (str) – The path to the original GFF3 file.

  • outfile (str) – The path to the destination dganno file.

  • reference_genome (Genome) – The reference assembly used to annotate infile.

Returns:

A list of built files created from the GFF3.

Return type:

list of str

static build_ncbi_refseq(infile, outfile, reference_genome)[source]

Build a binary version of NCBI annotations from a GFF3 file.

Currently supports conversion from gff3[.gz].

In addition to the versioned dganno file, a cfg file will also be created.

Parameters:
  • infile (str) – The path to the original GFF3 file.

  • outfile (str) – The path to the destination dganno file.

  • reference_genome (Genome) – The reference assembly used to annotate infile.

Returns:

A list of built files created from the GFF3.

Return type:

list of str

static build_ucsc_refseq(ucsc_db_dir, outfile, reference_genome)[source]

Build a binary version of UCSC RefSeq annotations.

The ucsc_db_dir should be a directory that already contains the following files:

  • refGene.txt[.gz]

  • refLink.txt[.gz]

These files will be parsed and used to build a dganno file.

In addition to the versioned dganno file, a cfg file will also be created.

Parameters:
  • ucsc_db_dir (str) – The path to a directory with the UCSC RefSeq source files.

  • outfile (str) – The path to the destination dganno file.

  • reference_genome (Genome) – The reference assembly used ucsc_db_dir.

Returns:

A list of built files created from the UCSC RefSeq files.

Return type:

list of str

__repr__()[source]

Return repr(self).

Gene

class Gene[source]

An annotated gene.

Bases: Interval

__init__()[source]
property interval

The interval spanned by this gene.

Note that Gene also inherits from Interval.

Returns:

The interval spanned by this gene.

Return type:

Interval

property id

The ID of this gene.

Returns either an Ensembl ID (for GENCODE) or a Entrez ID (for RefSeq) depending on which annotation is being represented.

Returns:

The ID of this transcript, e.g. “ENSG00000115275.11” or “7841”

Return type:

str

property name

The name of this gene.

Returns:

The name of this gene, e.g. “BRCA1”

Return type:

str

property type

The type of this gene.

This represents the gene_type field in most GFF3 annotations.

For GENCODE/Ensembl/NCBI RefSeq, the result can be any standard GENCODE biotype

For UCSC RefSeq, the result is “protein_coding” if any transcript is protein coding, and otherwise “non_coding”.

Returns:

The type of this gene, e.g. “protein_coding” or “non_coding”

Return type:

str

property level

The evidence level of this gene.

This represents the level field in most GFF3 annotations, either 1, 2, or 3.

Available for gencode and ensembl only.

Returns:

The evidence level of this gene.

Return type:

int | None

property transcripts

The transcripts for this gene.

A shorthand property trans is also available.

Returns:

A copy of the list of annotated transcripts for this gene.

Return type:

list of Transcript

__getstate__() bytes[source]

Helper for pickle.

__repr__()[source]

Return repr(self).

GeneTable

class GeneTable[source]
find_5p_aligned(interval)[source]

Returns all genes that have 5’ end aligned with 5’ end of interval.

This function is a faster version of the brute-force filter:

[gene for gene in genes if gene.end5 == interval.end5]

The results will also be sorted by 5’.

Parameters:

interval (Interval) – The query interval.

Returns:

All genes that have 5’ end aligned to the 5’ end of interval.

Return type:

list of Gene

find_3p_aligned(interval)[source]

Returns all genes that have 3’ end aligned with 3’ end of interval.

This function is a faster version of the brute-force filter:

[gene for gene in genes if gene.end3 == interval.end3]

The results will also be sorted by 3’.

Parameters:

interval (Interval) – The query interval.

Returns:

All genes that have 3’ end aligned to the 3’ end of interval.

Return type:

list of Gene

find_5p_within(interval)[source]

Returns all genes that have 5’-most position within interval.

This function is a faster version of the brute-force filter:

[gene for gene in genes if gene.end5.expand(0, 1) in interval]

The results will also be sorted by 5’.

Parameters:

interval (Interval) – The query interval.

Returns:

All genes that have 5’-most position falling entirely within interval.

Return type:

list of Gene

find_3p_within(interval)[source]

Returns all genes that have 3’-most position within interval.

This function is a faster version of the brute-force filter:

[gene for gene in genes if gene.end3.expand(1, 0) in interval]

The results will also be sorted by 3’.

Parameters:

interval (Interval) – The query interval.

Returns:

All genes that have 3’-most position falling entirely within interval.

Return type:

list of Gene

find_within(interval)[source]

Returns all genes that fall entirely within interval.

This function is a faster version of the brute-force filter:

[gene for gene in genes if gene in interval]

The results will also be sorted by 3’.

Parameters:

interval (Interval) – The query interval.

Returns:

All genes that fall entirely within interval.

Return type:

list of Gene

find_overlapping(interval)[source]

Returns all genes that overlap interval.

This function is a faster version of the brute-force filter:

[gene for gene in genes if gene.overlaps(interval)]

The results will also be sorted by 5’.

Parameters:

interval (Interval) – The query interval.

Returns:

All genes that overlap interval.

Return type:

list of Gene

find_exact(interval)[source]

Returns all genes that span exactly interval.

This function is a faster version of the brute-force filter:

[gene for gene in genes if gene.interval == interval]

The results will also be sorted by 5’ and then 3’.

Parameters:

interval (Interval) – The query interval.

Returns:

All genes that span exactly interval.

Return type:

list of Gene

property stranded

If True then the strand is significant when calling the find_x methods.

Returns:

Whether this table can contain negative stranded intervals.

Return type:

bool

__getitem__(index)[source]

Lookup a gene by index or ID string.

If index is a string the gene with matching id is returned (by linear search):

gene = genes['ENSG00000115275']  # MOGS gene

If the ID string has a version suffix ('ENSG00000115275.11') then the match must be exact.

One can also iterate over all genes:

for gene in genes:
    ...

The results will be ordered as in the source file.

Parameters:

index (int | str) – The integer index or ID string of the requested gene.

Returns:

The gene object identified by the given index.

Return type:

Gene

first_by_name(name: str) Gene | None[source]
Returns:

The first gene with the specified name or None if none matched.

Return type:

Gene

__repr__()[source]

Return repr(self).

Transcript

class Transcript[source]

An annotated transcript.

Bases: Interval

__init__()[source]
property interval

The interval spanned by this transcript.

Note that Transcript also inherits from Interval.

Returns:

The interval spanned by this transcript.

Return type:

Interval

property id

The ID of this transcript.

Returns either an Ensembl ID or a RefSeq ID depending on which annotation is being represented.

Returns:

The ID of this transcript, e.g. “ENST00000233616.8” or “NM_006302”

Return type:

str

property tsl

The support level of this transcript.

  • tsl1: all splice junctions supported by at least one non-suspect mRNA

  • tsl2: best supporting mRNA is flagged as suspect or support from multiple ESTs

  • tsl3: only support is from a single EST

  • tsl4: best supporting EST is flagged as suspect

  • tsl5: no single transcript supports the model structure

See Ensemble transcript support level.

Available for gencode only.

Returns:

The support level of this transcript, where values 1..5 and 6 correspond to GENCODE identifiers [tsl1, tsl2, tsl3, tsl4, tsl5, tslNA]

Return type:

int | None

property type

The type of this transcript.

This represents the transcript_type field in most GFF3 annotations, which may differ from the parent gene’s type.

For GENCODE/Ensembl/NCBI RefSeq, the result can be any standard GENCODE biotype

For UCSC RefSeq, the result is either “protein_coding” or “non_coding”.

Returns:

The type of this transcript, e.g. “protein_coding” or “non_coding”

Return type:

str

property level

The evidence level of this transcript.

This represents the level field in most GFF3 annotations, either 1, 2, or 3.

Available for gencode and ensembl only.

Returns:

The evidence level of this transcript.

Return type:

int | None

property ccds_id

The CCDSID of this transcript’s coding sequence, if applicable.

Available for gencode only.

Returns:

The CCDSID of this transcript, e.g. “CCDS53759.1”

Return type:

str | None

property protein_id

The protein ID of this transcript, if applicable.

Available for gencode, ensembl, ucsc_refseq, and ncbi_refseq.

Returns:

The protein ID of this transcript, e.g. “ENSP00000233616.4” or “NP_006293”

Return type:

str | None

property product

A description of the transcript’s product.

Available for ucsc_refseq and ncbi_refseq only.

Returns:

The product of this transcript, e.g. “breast cancer type 1 susceptibility protein isoform 2”

Return type:

str | None

property gene

The parent gene of this transcript.

Returns:

The parent gene for this transcript.

Return type:

Gene

property exons

The exons for this transcript.

Returns:

A copy of the list of annotated exons for this transcript.

Return type:

list of Exon

property introns

The introns for this transcript.

Returns:

A copy of the list of annotated introns for this transcript.

Return type:

list of Intron

property cdss

The coding sequences (CDSs) for this transcript.

Returns:

A copy of the list of annotated coding sequences for this transcript.

Return type:

list of Cds

property utr5s

The UTR5 segments for this transcript.

Returns:

A copy of the list of annotated UTR5 segments for this transcript.

Return type:

list of Utr

property utr3s

The UTR3 segments for this transcript.

Returns:

A copy of the list of annotated UTR3 segments for this transcript.

Return type:

list of Utr

__getstate__() bytes[source]

Helper for pickle.

__repr__()[source]

Return repr(self).

TranscriptTable

class TranscriptTable[source]
find_5p_aligned(interval)[source]

Returns all transcripts that have 5’ end aligned with 5’ end of interval.

This function is a faster version of the brute-force filter:

[tran for tran in transcripts if tran.end5 == interval.end5]

The results will also be sorted by 5’.

Parameters:

interval (Interval) – The query interval.

Returns:

All transcripts that have 5’ end aligned to the 5’ end of interval.

Return type:

list of Transcript

find_3p_aligned(interval)[source]

Returns all transcripts that have 3’ end aligned with 3’ end of interval.

This function is a faster version of the brute-force filter:

[tran for tran in transcripts if tran.end3 == interval.end3]

The results will also be sorted by 3’.

Parameters:

interval (Interval) – The query interval.

Returns:

All transcripts that have 3’ end aligned to the 3’ end of interval.

Return type:

list of Transcript

find_5p_within(interval)[source]

Returns all transcripts that have 5’-most position within interval.

This function is a faster version of the brute-force filter:

[tran for tran in transcripts if tran.end5.expand(0, 1) in interval]

The results will also be sorted by 5’.

Parameters:

interval (Interval) – The query interval.

Returns:

All transcripts that have 5’-most position falling entirely within interval.

Return type:

list of Transcript

find_3p_within(interval)[source]

Returns all transcripts that have 3’-most position within interval.

This function is a faster version of the brute-force filter:

[tran for tran in transcripts if tran.end3.expand(1, 0) in interval]

The results will also be sorted by 3’.

Parameters:

interval (Interval) – The query interval.

Returns:

All transcripts that have 3’-most position falling entirely within interval.

Return type:

list of Transcript

find_within(interval)[source]

Returns all transcripts that fall entirely within interval.

This function is a faster version of the brute-force filter:

[tran for tran in transcripts if tran in interval]

The results will also be sorted by 3’.

Parameters:

interval (Interval) – The query interval

Returns:

All transcripts that fall entirely within interval.

Return type:

list of Transcript

find_overlapping(interval)[source]

Returns all transcripts that overlap interval.

This function is a faster version of the brute-force filter:

[tran for tran in transcripts if tran.overlaps(interval)]

The results will also be sorted by 5’.

Parameters:

interval (Interval) – The query interval.

Returns:

All transcripts that overlap interval.

Return type:

list of Transcript

find_exact(interval)[source]

Returns all transcripts that span exactly interval.

This function is a faster version of the brute-force filter:

[tran for tran in transcripts if tran.interval == interval]

The results will also be sorted by 5’ and then 3’.

Parameters:

interval (Interval) – The query interval.

Returns:

All transcripts that span exactly interval.

Return type:

list of Transcript

property stranded

If True then the strand is significant when calling the find_x methods.

Returns:

Whether this table can contain negative stranded intervals.

Return type:

bool

__getitem__(index)[source]

Lookup a transcript by index or ID string.

If index is a string the transcript with matching id is returned (by linear search):

tran = transcripts['ENST00000233616']  # MOGS transcript

If the ID string has a version suffix ('ENST00000233616.8') then the match must be exact.

One can also iterate over all transcripts:

for tran in transcripts:
    ...

The results will be ordered as in the source file.

Parameters:

index (int | str) – The integer index or ID string of the requested transcript.

Returns:

The transcript object identified by the given index.

Return type:

Transcript

__repr__()[source]

Return repr(self).

Exon

class Exon[source]

An annotated exon.

Bases: Interval

Each Exon object is equal only to itself. To build a set of unique exonic intervals, use the interval attribute:

>>> from genome_kit import Genome
>>> genome = Genome('gencode.v19')
>>> exonic_intervals = set([exon.interval for exon in genome.exons])
__init__()[source]
property interval

The interval spanned by this exon.

Note that Exon also inherits from Interval.

Returns:

The interval spanned by this exon.

Return type:

Interval

property id

The ID of this exon.

Available for gencode and ensembl only.

Returns:

The ID of this exon, e.g. “ENSE00000846737.3”

Return type:

str | None

property index

The index of this exon within the transcript.

Returns:

The index of the exon number within the parent transcript (0-based).

Return type:

int

property transcript

The parent transcript of this exon.

A shorthand property tran is also available.

Returns:

The parent transcript of this exon.

Return type:

Transcript

property cds

The CDS on this exon, if any.

Returns:

The CDS contained within this exon, if any.

Return type:

Cds | None

property utr5

The UTR5 on this exon, if any.

Returns:

The UTR5 contained within this exon, if any.

Return type:

Utr5 | None

property utr3

The UTR3 on this exon, if any.

Returns:

The UTR3 contained within this exon, if any.

Return type:

Utr3 | None

property next_exon

The next exon in the transcript, if any.

Returns:

The next annotated exon on the parent transcript, if any.

Return type:

Exon | None

property prev_exon

The previous exon in the transcript, if any.

Returns:

The previous annotated exon on the parent transcript, if any.

Return type:

Exon | None

property prev_intron

Returns the previous intron in the transcript, if any.

Returns:

The previous annotated intron on the parent transcript, if any.

Return type:

Intron | None

property next_intron

Returns the next intron in the transcript, if any.

Returns:

The next annotated intron on the parent transcript, if any.

Return type:

Intron | None

__getstate__() bytes[source]

Helper for pickle.

__repr__()[source]

Return repr(self).

ExonTable

class ExonTable[source]
find_5p_aligned(interval)[source]

Returns all exons that have 5’ end aligned with 5’ end of interval.

This function is a faster version of the brute-force filter:

[exon for exon in exons if exon.end5 == interval.end5]

The results will also be sorted by 5’.

Parameters:

interval (Interval) – The query interval.

Returns:

All exons that have 5’ end aligned to the 5’ end of interval.

Return type:

list of Exon

find_3p_aligned(interval)[source]

Returns all exons that have 3’ end aligned with 3’ end of interval.

This function is a faster version of the brute-force filter:

[exon for exon in exons if exon.end3 == interval.end3]

The results will also be sorted by 3’.

Parameters:

interval (Interval) – The query interval.

Returns:

All exons that have 3’ end aligned to the 3’ end of interval.

Return type:

list of Exon

find_5p_within(interval)[source]

Returns all exons that have 5’-most position within interval.

This function is a faster version of the brute-force filter:

[exon for exon in exons if exon.end5.expand(0, 1) in interval]

The results will also be sorted by 5’.

Parameters:

interval (Interval) – The query interval.

Returns:

All exons that have 5’-most position falling entirely within interval.

Return type:

list of Exon

find_3p_within(interval)[source]

Returns all exons that have 3’-most position within interval.

This function is a faster version of the brute-force filter:

[exon for exon in exons if exon.end3.expand(1, 0) in interval]

The results will also be sorted by 3’.

Parameters:

interval (Interval) – The query interval.

Returns:

All exons that have 3’-most position falling entirely within interval.

Return type:

list of Exon

find_within(interval)[source]

Returns all exons that fall entirely within interval.

This function is a faster version of the brute-force filter:

[exon for exon in exons if exon in interval]

The results will also be sorted by 3’.

Parameters:

interval (Interval) – The query interval

Returns:

All exons that fall entirely within interval.

Return type:

list of Exon

find_overlapping(interval)[source]

Returns all exons that overlap interval.

This function is a faster version of the brute-force filter:

[exon for exon in exons if exon.overlaps(interval)]

The results will also be sorted by 5’.

Parameters:

interval (Interval) – The query interval.

Returns:

All exons that overlap interval.

Return type:

list of Exon

find_exact(interval)[source]

Returns all exons that span exactly interval.

This function is a faster version of the brute-force filter:

[exon for exon in exons if exon.interval == interval]

The results will also be sorted by 5’ and then 3’.

Parameters:

interval (Interval) – The query interval.

Returns:

All exons that span exactly interval.

Return type:

list of Exon

property stranded

If True then the strand is significant when calling the find_x methods.

Returns:

Whether this table can contain negative stranded intervals.

Return type:

bool

__getitem__(index)[source]

Lookup an exon by index.

Allows iteration over all exons:

for exon in exons:
    ...

The results will be ordered as in the source file.

Parameters:

index (int) – The integer index of the requested exon.

Returns:

The exon object identified by the given index.

Return type:

Exon

__repr__()[source]

Return repr(self).

Intron

class Intron[source]

An annotated intron.

Bases: Interval

Each Intron object is equal only to itself. To build a set of unique intronic intervals, use the interval attribute:

>>> from genome_kit import Genome
>>> genome = Genome('gencode.v19')
>>> intronic_intervals = set([intron.interval for intron in genome.introns])
__init__()[source]
property interval

The interval spanned by this intron.

Note that Intron also inherits from Interval.

Returns:

The interval spanned by this intron.

Return type:

Interval

property index

The index of the intron within the transcript.

Returns:

The index of the intron within the parent transcript (0-based).

Return type:

int

property transcript

The parent transcript of this intron.

A shorthand property tran is also available.

Returns:

The parent transcript of this intron.

Return type:

Transcript

property next_intron

The next intron in the transcript, if any.

Returns:

The next annotated intron on the parent transcript, if any.

Return type:

Intron | None

property prev_intron

The previous intron in the transcript, if any.

Returns:

The previous annotated intron on the parent transcript, if any.

Return type:

Intron | None

property prev_exon

Returns the previous exon in the transcript.

Returns:

The previous annotated exon on the parent transcript.

Return type:

Exon

property next_exon

Returns the next exon in the transcript.

Returns:

The next annotated exon on the parent transcript.

Return type:

Exon

__getstate__() bytes[source]

Helper for pickle.

__repr__()[source]

Return repr(self).

IntronTable

class IntronTable[source]
find_5p_aligned(interval)[source]

Returns all introns that have 5’ end aligned with 5’ end of interval.

This function is a faster version of the brute-force filter:

[intr for intr in introns if intr.end5 == interval.end5]

The results will also be sorted by 5’.

Parameters:

interval (Interval) – The query interval.

Returns:

All introns that have 5’ end aligned to the 5’ end of interval.

Return type:

list of Intron

find_3p_aligned(interval)[source]

Returns all introns that have 3’ end aligned with 3’ end of interval.

This function is a faster version of the brute-force filter:

[intr for intr in introns if intr.end3 == interval.end3]

The results will also be sorted by 3’.

Parameters:

interval (Interval) – The query interval.

Returns:

All introns that have 3’ end aligned to the 3’ end of interval.

Return type:

list of Intron

find_5p_within(interval)[source]

Returns all introns that have 5’-most position within interval.

This function is a faster version of the brute-force filter:

[intr for intr in introns if intr.end5.expand(0, 1) in interval]

The results will also be sorted by 5’.

Parameters:

interval (Interval) – The query interval.

Returns:

All introns that have 5’-most position falling entirely within interval.

Return type:

list of Intron

find_3p_within(interval)[source]

Returns all introns that have 3’-most position within interval.

This function is a faster version of the brute-force filter:

[intr for intr in introns if intr.end3.expand(1, 0) in interval]

The results will also be sorted by 3’.

Parameters:

interval (Interval) – The query interval.

Returns:

All introns that have 3’-most position falling entirely within interval.

Return type:

list of Intron

find_within(interval)[source]

Returns all introns that fall entirely within interval.

This function is a faster version of the brute-force filter:

[intr for intr in introns if intr in interval]

The results will also be sorted by 3’.

Parameters:

interval (Interval) – The query interval

Returns:

All introns that fall entirely within interval.

Return type:

list of Intron

find_overlapping(interval)[source]

Returns all introns that overlap interval.

This function is a faster version of the brute-force filter:

[intr for intr in introns if intr.overlaps(interval)]

The results will also be sorted by 5’.

Parameters:

interval (Interval) – The query interval.

Returns:

All introns that overlap interval.

Return type:

list of Intron

find_exact(interval)[source]

Returns all introns that span exactly interval.

This function is a faster version of the brute-force filter:

[intr for intr in introns if intr.interval == interval]

The results will also be sorted by 5’ and then 3’.

Parameters:

interval (Interval) – The query interval.

Returns:

All introns that span exactly interval.

Return type:

list of Intron

property stranded

If True then the strand is significant when calling the find_x methods.

Returns:

Whether this table can contain negative stranded intervals.

Return type:

bool

__getitem__(index)[source]

Lookup an intron by index.

Allows iteration over all introns:

for intron in introns:
    ...

The results will be ordered as in the source file.

Parameters:

index (int) – The integer index of the requested intron.

Returns:

The intron object identified by the given index.

Return type:

Intron

__repr__()[source]

Return repr(self).

Cds

class Cds[source]

An annotated coding sequence (CDS) element within an exon.

Bases: Interval

__init__()[source]
property interval

The interval spanned by this CDS.

Note that Cds also inherits from Interval.

Returns:

The interval spanned by this CDS.

Return type:

Interval

property phase

The phase of this CDS’s start position.

The phase is the number of bases that should be trimmed from the 5’ end of the CDS element to reach the start of next codon:

  • phase 0 means the 5p end is in-frame

  • phase 1 means the 5p end is 1 bp away from being in-frame.

  • phase 2 means the 5p end is 2 bp away from being in-frame.

See the GFF3 documentation <http://gmod.org/wiki/GFF3> for a precise explanation.

The frame attribute in UCSC RefSeq is not available.

Returns:

The phase this CDS, either 0, 1, or 2.

Return type:

int

property exon

The parent exon of this CDS.

Returns:

The parent exon of this CDS.

Return type:

Exon

property transcript

The parent transcript of this CDS.

A shorthand property tran is also available.

Returns:

The parent transcript of this CDS.

Return type:

Transcript

property next_cds

The next CDS in the transcript, if any.

Returns:

The next annotated CDS on the parent transcript, if any.

Return type:

Cds | None

property prev_cds

The previous CDS in the transcript, if any.

Returns:

The previous annotated CDS on the parent transcript, if any.

Return type:

Cds | None

__getstate__() bytes[source]

Helper for pickle.

__repr__()[source]

Return repr(self).

CdsTable

class CdsTable[source]
find_5p_aligned(interval)[source]

Returns all CDSs that have 5’ end aligned with 5’ end of interval.

This function is a faster version of the brute-force filter:

[cds for cds in cdss if cds.end5 == interval.end5]

The results will also be sorted by 5’.

Parameters:

interval (Interval) – The query interval.

Returns:

All coding sequences (CDSs) that have 5’ end aligned to the 5’ end of interval.

Return type:

list of Cds

find_3p_aligned(interval)[source]

Returns all CDSs that have 3’ end aligned with 3’ end of interval.

This function is a faster version of the brute-force filter:

[cds for cds in cdss if cds.end3 == interval.end3]

The results will also be sorted by 3’.

Parameters:

interval (Interval) – The query interval.

Returns:

All coding sequences (CDSs) that have 3’ end aligned to the 3’ end of interval.

Return type:

list of Cds

find_5p_within(interval)[source]

Returns all CDSs that have 5’-most position within interval.

This function is a faster version of the brute-force filter:

[cds for cds in cdss if cds.end5.expand(0, 1) in interval]

The results will also be sorted by 5’.

Parameters:

interval (Interval) – The query interval.

Returns:

All coding sequences (CDSs) that have 5’-most position falling entirely within interval.

Return type:

list of Cds

find_3p_within(interval)[source]

Returns all CDSs that have 3’-most position within interval.

This function is a faster version of the brute-force filter:

[cds for cds in cdss if cds.end3.expand(1, 0) in interval]

The results will also be sorted by 3’.

Parameters:

interval (Interval) – The query interval.

Returns:

All coding sequences (CDSs) that have 3’-most position falling entirely within interval.

Return type:

list of Cds

find_within(interval)[source]

Returns all CDSs that fall entirely within interval.

This function is a faster version of the brute-force filter:

[cds for cds in cdss if cds in interval]

The results will also be sorted by 3’.

Parameters:

interval (Interval) – The query interval

Returns:

All coding sequences (CDSs) that fall entirely within interval.

Return type:

list of Cds

find_overlapping(interval)[source]

Returns all CDSs that overlap interval.

This function is a faster version of the brute-force filter:

[cds for cds in cdss if cds.overlaps(interval)]

The results will also be sorted by 5’.

Parameters:

interval (Interval) – The query interval.

Returns:

All coding sequences (CDSs) that overlap interval.

Return type:

list of Cds

find_exact(interval)[source]

Returns all CDSs that span exactly interval.

This function is a faster version of the brute-force filter:

[cds for cds in cdss if cds.interval == interval]

The results will also be sorted by 5’ and then 3’.

Parameters:

interval (Interval) – The query interval.

Returns:

All coding sequences (CDSs) that span exactly interval.

Return type:

list of Cds

property stranded

If True then the strand is significant when calling the find_x methods.

Returns:

Whether this table can contain negative stranded intervals.

Return type:

bool

__getitem__(index)[source]

Lookup a coding sequence element by index.

Allows iteration over all coding sequences:

for cds in cdss:
    ...
Parameters:
  • index (int) – The integer index of the requested CDS.

  • file. (The results will be ordered as in the source)

Returns:

The CDS identified by the given index.

Return type:

Cds

__repr__()[source]

Return repr(self).

Read Alignments

ReadAlignments

class ReadAlignments(infile)[source]

Access to read alignments.

__init__(infile)[source]

Open a .ralign file.

Parameters:

infile (str) – The .ralign file to open.

Raises:

IOError – Occurs when infile refers to an incompatible ralign file for this version of genome_kit.

property junctions

Access to junctions inferred from read alignments

Returns:

A table with optimized methods to query over all junctions by position.

Return type:

JunctionTable

property alignments

Access to read alignments, i.e. a row in BAM file

Returns:

A table with optimized methods to query over all read alignments by position.

Return type:

AlignmentTable

property matches

Access to aligned regions of read alignments

Returns:

A table with optimized methods to query over all aligned regions by position.

Return type:

AlignmentMatchTable

property variants

Access to variants observed by read alignments

Returns:

A table where each element (each row) is a variant that was observed in at least one read alignment.

Return type:

VariantTable

property filename

The path to the file from which read alignments are retrieved.

Returns:

The path to the file, e.g. “/mnt/data/sample000132.ralign”

Return type:

str

close()[source]

Close the file handle.

Note that if you use a with statement the file handle is automatically closed:

# Open the file
with ReadAlignments('my_file.ralign') as raligns:
    ...

# <-- File is now closed
static ralign_version()[source]

The ralign file format version that this build supports.

Attempting to open an ralign file of a mismatched version will raise an IOError.

Returns:

The version number.

Return type:

int

static build_ralign(outfile, infiles, reference_genome, exclude=None, allow=None, include_duplicates=False, library_format='U')[source]

Build an ralign file from one or more SAM files.

If multiple input files are specified, their junctions and alignments will be pooled.

Once an ralign file is created, it can be opened by creating a ReadAlignments object.

Alignments on the reference genome. All junction, alignment, match, and variant intervals in the resulting file are on the reference genome.

For example, an alignment starting at reference genome position 2 and with CIGAR string 1M1D1M2N1M is matched to the reference genome as:

0 1 2 3 4 5 6 7 8 9     <-- reference genome coordinates
    M D M - - M M       <-- M = match, D = deletion

This results in one alignment (2:9), two matches (2:5, 7:9), one junction (5:7), and one variant (3:4).

Similarly for insertions, with CIGAR string 1M1I1M2N1M aligns as:

0 1 2 3 4 5 6 7 8 9
    M M - - M M
     ^
     I                  <-- I = insertion between

This results in one alignment (2:8), two matches (2:4, 6:8), one junction (4:6), and one variant (3:3).

Parameters:
  • outfile (str) – The path to the destination rdist file.

  • infiles (list of str | file) – The paths to the source SAM files, or a reference to sys.stdin. Streaming lines from stdin is useful for reading directly from BAM via samtools view -h. Be sure to use -h so that the reference genome can be inferred from the header lines.

  • reference_genome (Genome) – The reference genome of the data in infiles. If the genome contains annotations, junction strands will be inferred and validated against the introns annotations.

  • exclude (list of Interval, optional) – Junctions within these intervals will be excluded.

  • allow (list of Interval, optional) – Only junctions within these intervals will be included, so long as they are not excluded.

  • include_duplicates (bool, optional) – Includes duplicates when True. Defaults to False. See https://github.com/samtools/hts-specs/blob/master/SAMv1.pdf subsection 1.4.2

  • library_format (str, optional) – Specifies to infer junction strands via the library format string. “SF” implies the first read is the sense strand, whereas “SR” implies the second read is the sense strand. (the default is ‘U’, which represents unstranded [no strand inference]). See https://salmon.readthedocs.io/en/stable/library_type.html.

pileup(interval, dtype=<class 'numpy.int32'>)[source]

Returns the DNA sequence pileup for body reads in this ReadAlignments as a 5-track. The pileup is variant-aware, so the track counts the alternate sequences, instead of the reference, on any alignment matches with variants.

Currently, only acgtACGT substitutions and deletions variants are supported.

Parameters:
  • interval (Interval) – A positive interval to query over all the read alignments.

  • dtype (dtype, optional) – The data type for the returned counts.

Returns:

An array of shape (len(interval), 5) where the second dimension is indexed via PILEUP_A, PILEUP_C, PILEUP_G, PILEUP_T, and PILEUP_DEL.

Return type:

ndarray

__repr__()[source]

Return repr(self).

Junction

class Junction[source]

A Junction represents a gap in a read alignment when compared to the reference genome.

Bases: Interval

__init__()[source]
property interval

The interval spanned by this junction.

Note that Junction also inherits from Interval.

Returns:

The interval spanned by this junction.

Return type:

Interval

property alignments

The read alignments across this junction.

Returns:

The read alignments across this junction.

Return type:

tuple of Alignment

__repr__()[source]

Return repr(self).

JunctionTable

class JunctionTable[source]
find_5p_aligned(interval)[source]

Returns all junctions that have 5’ end aligned with 5’ end of interval.

This function is a faster version of the brute-force filter:

[junc for junc in junctions if junc.end5 == interval.end5]
Parameters:

interval (Interval) – The query interval.

Returns:

All junctions that have 5’ end aligned to the 5’ end of interval.

Return type:

list of Junction

find_3p_aligned(interval)[source]

Returns all junctions that have 3’ end aligned with 3’ end of interval.

This function is a faster version of the brute-force filter:

[junc for junc in junctions if junc.end3 == interval.end3]
Parameters:

interval (Interval) – The query interval.

Returns:

All junctions that have 3’ end aligned to the 3’ end of interval.

Return type:

list of Junction

find_5p_within(interval)[source]

Returns all junctions that have 5’-most position within interval.

This function is a faster version of the brute-force filter:

[junc for junc in junctions if junc.end5.expand(0, 1) in interval]
Parameters:

interval (Interval) – The query interval.

Returns:

All junctions that have 5’-most position falling entirely within interval.

Return type:

list of Junction

find_3p_within(interval)[source]

Returns all junctions that have 3’-most position within interval.

This function is a faster version of the brute-force filter:

[junc for junc in junctions if junc.end3.expand(1, 0) in interval]
Parameters:

interval (Interval) – The query interval.

Returns:

All junctions that have 3’-most position falling entirely within interval.

Return type:

list of Junction

find_within(interval)[source]

Returns all junctions that fall entirely within interval.

This function is a faster version of the brute-force filter:

[junc for junc in junctions if junc in interval]
Parameters:

interval (Interval) – The query interval

Returns:

All junction that fall entirely within interval.

Return type:

list of Junction

find_overlapping(interval)[source]

Returns all junctions that overlap interval.

This function is a faster version of the brute-force filter:

[junc for junc in junctions if junc.overlaps(interval)]
Parameters:

interval (Interval) – The query interval.

Returns:

All junctions that overlap interval.

Return type:

list of Junction

find_exact(interval)[source]

Returns all junctions that span exactly interval.

This function is a faster version of the brute-force filter:

[junc for junc in junctions if junc.interval == interval]
Parameters:

interval (Interval) – The query interval.

Returns:

All junctions that span exactly interval.

Return type:

list of Junction

property stranded

If True then the strand is significant when calling the find_x methods.

Returns:

Whether this table can contain negative stranded intervals.

Return type:

bool

__getitem__(index)[source]

Lookup a junction by index

Allows iteration over all junctions:

for junc in junctions:
    ...
Parameters:

index (int) – The integer index of the requested junction.

Returns:

The junction object identified by the given index.

Return type:

Junction

__repr__()[source]

Return repr(self).

Alignment

class Alignment[source]

A read alignment

Bases: Interval

__init__()[source]
property interval

The interval spanned by this read alignment.

Note that Alignment also inherits from Interval.

Returns:

The interval spanned by this read alignment.

Return type:

Interval

property matches

The matching regions of this read alignment

Returns:

A list of matching regions

Return type:

list of AlignmentMatch

__repr__()[source]

Return repr(self).

AlignmentTable

class AlignmentTable[source]
find_5p_aligned(interval)[source]

Returns all read alignments that have 5’ end aligned with 5’ end of interval.

This function is a faster version of the brute-force filter:

[read for read in reads if read.end5 == interval.end5]
Parameters:

interval (Interval) – The query interval.

Returns:

All read alignments that have 5’ end aligned to the 5’ end of interval.

Return type:

list of Alignment

find_3p_aligned(interval)[source]

Returns all read alignments that have 3’ end aligned with 3’ end of interval.

This function is a faster version of the brute-force filter:

[read for read in reads if read.end3 == interval.end3]
Parameters:

interval (Interval) – The query interval.

Returns:

All read alignments that have 3’ end aligned to the 3’ end of interval.

Return type:

list of Alignment

find_5p_within(interval)[source]

Returns all read alignments that have 5’-most position within interval.

This function is a faster version of the brute-force filter:

[read for read in reads if read.end5.expand(0, 1) in interval]
Parameters:

interval (Interval) – The query interval.

Returns:

All read alignments that have 5’-most position falling entirely within interval.

Return type:

list of Alignment

find_3p_within(interval)[source]

Returns all read alignments that have 3’-most position within interval.

This function is a faster version of the brute-force filter:

[read for read in reads if read.end3.expand(1, 0) in interval]
Parameters:

interval (Interval) – The query interval.

Returns:

All read alignments that have 3’-most position falling entirely within interval.

Return type:

list of Alignment

find_within(interval)[source]

Returns all read alignments that fall entirely within interval.

This function is a faster version of the brute-force filter:

[read for read in reads if read in interval]
Parameters:

interval (Interval) – The query interval

Returns:

All read alignment that fall entirely within interval.

Return type:

list of Alignment

find_overlapping(interval)[source]

Returns all read alignments that overlap interval.

This function is a faster version of the brute-force filter:

[read for read in reads if read.overlaps(interval)]
Parameters:

interval (Interval) – The query interval.

Returns:

All read alignments that overlap interval.

Return type:

list of Alignment

find_exact(interval)[source]

Returns all read alignments that span exactly interval.

This function is a faster version of the brute-force filter:

[read for read in reads if read.interval == interval]
Parameters:

interval (Interval) – The query interval.

Returns:

All read alignments that span exactly interval.

Return type:

list of Alignment

property stranded

If True then the strand is significant when calling the find_x methods.

Returns:

Whether this table can contain negative stranded intervals.

Return type:

bool

__getitem__(index)[source]

Lookup a read alignment by index

Allows iteration over all read alignments:

for read in reads:
    ...
Parameters:

index (int) – The integer index of the requested read alignment.

Returns:

The read alignment object identified by the given index.

Return type:

Alignment

__repr__()[source]

Return repr(self).

AlignmentMatch

class AlignmentMatch[source]

A matching region a read alignment

Bases: Interval

__init__()[source]
property interval

The interval spanned by the matching region of a read alignment.

Note that AlignmentMatch also inherits from Interval.

Returns:

The interval spanned by this alignment match.

Return type:

Interval

property variants

A list of variants observed in this matching region of an alignment.

Returns:

A tuple of variants that this matching region of an alignment contains

Return type:

tuple of Variant

__repr__()[source]

Return repr(self).

AlignmentMatchTable

class AlignmentMatchTable[source]
find_5p_aligned(interval)[source]

Returns all alignment matches that have 5’ end aligned with 5’ end of interval.

This function is a faster version of the brute-force filter:

[match for match in matches if match.end5 == interval.end5]
Parameters:

interval (Interval) – The query interval.

Returns:

All alignment matches that have 5’ end aligned to the 5’ end of interval.

Return type:

list of AlignmentMatch

find_3p_aligned(interval)[source]

Returns all alignment matches that have 3’ end aligned with 3’ end of interval.

This function is a faster version of the brute-force filter:

[match for match in matches if match.end3 == interval.end3]
Parameters:

interval (Interval) – The query interval.

Returns:

All alignment matches that have 3’ end aligned to the 3’ end of interval.

Return type:

list of AlignmentMatch

find_5p_within(interval)[source]

Returns all alignment matches that have 5’-most position within interval.

This function is a faster version of the brute-force filter:

[match for match in matches if match.end5.expand(0, 1) in interval]
Parameters:

interval (Interval) – The query interval.

Returns:

All alignment matches that have 5’-most position falling entirely within interval.

Return type:

list of AlignmentMatch

find_3p_within(interval)[source]

Returns all alignment matches that have 3’-most position within interval.

This function is a faster version of the brute-force filter:

[match for match in matches if match.end3.expand(1, 0) in interval]
Parameters:

interval (Interval) – The query interval.

Returns:

All alignments matches that have 3’-most position falling entirely within interval.

Return type:

list of AlignmentMatch

find_within(interval)[source]

Returns all alignment matches that fall entirely within interval.

This function is a faster version of the brute-force filter:

[match for match in matches if match in interval]
Parameters:

interval (Interval) – The query interval

Returns:

All alignment matches that fall entirely within interval.

Return type:

list of AlignmentMatch

find_overlapping(interval)[source]

Returns all alignment matches that overlap interval.

This function is a faster version of the brute-force filter:

[match for match in matches if match.overlaps(interval)]
Parameters:

interval (Interval) – The query interval.

Returns:

All alignment matches that overlap interval.

Return type:

list of AlignmentMatch

find_exact(interval)[source]

Returns all alignment matches that span exactly interval.

This function is a faster version of the brute-force filter:

[match for match in matches if match.interval == interval]
Parameters:

interval (Interval) – The query interval.

Returns:

All alignment matches that span exactly interval.

Return type:

list of AlignmentMatch

property stranded

If True then the strand is significant when calling the find_x methods.

Returns:

Whether this table can contain negative stranded intervals.

Return type:

bool

__getitem__(index)[source]

Lookup an alignment match by index

Allows iteration over all alignment matches:

for match in matches:
    ...
Parameters:

index (int) – The integer index of the requested alignment match.

Returns:

The alignment match object identified by the given index.

Return type:

AlignmentMatch

__repr__()[source]

Return repr(self).

Junction Read Alignments

JReadAlignments

class JReadAlignments(infile)[source]

Access to read alignments.

__init__(infile)[source]

Open a .jralign file.

Parameters:

infile (str) – The .jralign file to open.

property junctions

Access to read alignments that are split across junctions.

Returns:

An object that supports looping or queries over junctions, where each junction provides a read distribution.

Return type:

JunctionReadAlignmentsTable

Notes

The returned table. junctions, and child objects should not be accessed after this JReadAlignments has been closed via close().

property variants

Access to variants observed over all read alignments

Returns:

A table where each element (each row) is a variant that was observed in at least one read alignment.

Return type:

VariantTable

Notes

The returned table and variants should not be accessed after this JReadAlignments has been closed via close().

property filename

The path to the file from which read alignments are retrieved.

Returns:

The path to the file, e.g. “/mnt/data/sample000132.jralign”

Return type:

str

close()[source]

Close the file handle.

Note that if you use a with statement the file handle is automatically closed:

# Open the file
with JReadAlignments('my_file.jralign') as jraligns:
    ...

# <-- File is now closed

Notes

Closing this JReadAlignments will invalidate all tables and objects previously accessed via junctions() and variants().

static jralign_version()[source]

The jralign file format version that this build supports.

Attempting to open an jralign file of a mismatched version will raise an IOError.

Returns:

The version number.

Return type:

int

static build_jralign(outfile, infiles, reference_genome, min_reads=0, min_overhang=0, exclude=None, allow=None, include_variants=False, include_duplicates=False, library_format='U', overhang_error='error')[source]

Build an jralign file from one or more SAM files.

If multiple input files are specified, their junctions and read alignments will be pooled.

Once an jralign file is created, it can be opened by creating a JReadAlignments object.

Below are some details on how SAM alignments are processed, which are important for understanding how a CIGAR strings containing insertions/deletions (5M8D5M) or multiple splits (5M100N10M50N5M) are processed:

  • Multi-split alignments. If a read maps across multiple junctions, its CIGAR string will contain multiple N blocks. The read will then contribute a separate (left, right) alignment to each junction that it spans.

    For example, consider 1S2M2N1M1N1M where the first match is aligned to start position 2:

    0 1 2 3 4 5 6 7 8 9
      S M M - - M - M
    

    This read will contribute a separate alignment to each junction. The left and right overhang is the sum of all alignment matches before/after that particular junction. In the above example, the resulting alignments would be be as if two separate reads were aligned as follows:

    0 1 2 3 4 5 6 7 8 9
      S M M - - M M        <-- alignment for 1st junction
          S M M M - M      <-- alignment for 2nd junction
    

    In this scenario, the read would not actually match the reference genome in the above alignments, but this scheme facilitates counting of relative junction usage (assuming usage of the two junctions is independent) and also is allows positional bootstrap to be applied correctly without modification.

  • Overhang filter. The min_overhang filter is applied on a per-junction bases. For example, if min_overhang=5 then a read mapped across two junctions 20M100N10M100N3M will contribute an alignment only to the first, because the second has a right overhang of 3.

  • Counting insertions/deletions. Only alignment matches are counted. Other CIGAR types (e.g., insertions, deletions, clippings) are not relevant when filtering for alignment quality.

    For example, a read starting at reference genome position 2 and with CIGAR string 1M1D1M2N1M is matched to the reference genome as:

    0 1 2 3 4 5 6 7 8 9     <-- reference genome coordinates
        M D M - - M         <-- M = match, D = deletion
    

    so the start of the junction is 2+1+1+1 = 5, but the the left overhang is only 2, because the deleted base doesn’t exist in the private genome.

    Similarly for insertions, with CIGAR string 1M1I1M2N1M:

       0 1 2 3 4 5 6 7 8 9
           M M - - M
            ^
            I                <-- I = insertion between
    
    the start of the junction is 2+1+1 = 4, but the left
    overhang is actually 2, because the inserted base exists
    only in the private genome.
    
Parameters:
  • outfile (str) – The path to the destination rdist file.

  • infiles (list of str | file) – The paths to the source SAM files, or a reference to sys.stdin. Streaming lines from stdin is useful for reading directly from BAM via samtools view -h. Be sure to use -h so that the reference genome can be inferred from the header lines.

  • reference_genome (Genome) – The reference genome of the data in infiles. If the genome contains annotations, junction strands will be inferred and validated against the introns annotations.

  • min_reads (int) – Optional. The minimum number of reads across a junction to be considered for inclusion in the output file.

  • min_overhang (int) – Optional. The minimum overhang for a read to be included.

  • exclude (list of Interval) – Optional. Junctions within these intervals will be excluded.

  • allow (list of Interval) – Optional. Only junctions within these intervals will be included, so long as they are not excluded.

  • include_variants (:py:class`bool`) – Optional. True if variants are to be included in the output file. Default will exclude all variants.

  • include_duplicates (bool, optional) – Includes duplicates when True. Defaults to False. See https://github.com/samtools/hts-specs/blob/master/SAMv1.pdf subsection 1.4.2

  • library_format (str, optional) – Specifies to infer junction strands via the library format string. “SF” implies the first read is the sense strand, whereas “SR” implies the second read is the sense strand. (the default is ‘U’, which represents unstranded [no strand inference]). See https://salmon.readthedocs.io/en/stable/library_type.html.

  • overhang_error (: str) – Determines how build_jralign handles reads with >255 length overhangs. “error” stops and reports the error. “clamp” limits the read to the maximum 255 positions on each side. This only affects overhangs and will not clip or filter variants.

__repr__()[source]

Return repr(self).

JunctionReadAlignmentsTable

class JunctionReadAlignmentsTable[source]
find_5p_aligned(interval)[source]

Returns all junctions that have 5’ end aligned with 5’ end of interval.

This function is a faster version of the brute-force filter:

[junc for junc in junctions if junc.end5 == interval.end5]

The results will also be sorted by 5’.

Parameters:

interval (Interval) – The query interval.

Returns:

All junctions that have 5’ end aligned to the 5’ end of interval.

Return type:

list of JunctionReadAlignments

find_3p_aligned(interval)[source]

Returns all junctions that have 3’ end aligned with 3’ end of interval.

This function is a faster version of the brute-force filter:

[junc for junc in junctions if junc.end3 == interval.end3]

The results will also be sorted by 3’.

Parameters:

interval (Interval) – The query interval.

Returns:

All junctions that have 3’ end aligned to the 3’ end of interval.

Return type:

list of JunctionReadAlignments

find_5p_within(interval)[source]

Returns all junctions that have 5’-most position within interval.

This function is a faster version of the brute-force filter:

[junc for junc in junctions if junc.end5.expand(0, 1) in interval]

The results will also be sorted by 5’.

Parameters:

interval (Interval) – The query interval.

Returns:

All junctions that have 5’-most position falling entirely within interval.

Return type:

list of JunctionReadAlignments

find_3p_within(interval)[source]

Returns all junctions that have 3’-most position within interval.

This function is a faster version of the brute-force filter:

[junc for junc in junctions if junc.end3.expand(1, 0) in interval]

The results will also be sorted by 3’.

Parameters:

interval (Interval) – The query interval.

Returns:

All junctions that have 3’-most position falling entirely within interval.

Return type:

list of JunctionReadAlignments

find_within(interval)[source]

Returns all junctions that fall entirely within interval.

This function is a faster version of the brute-force filter:

[junc for junc in junctions if junc in interval]

The results will also be sorted by 3’.

Parameters:

interval (Interval) – The query interval.

Returns:

All junctions that fall entirely within interval.

Return type:

list of JunctionReadAlignments

find_overlapping(interval)[source]

Returns all junctions that overlap interval.

This function is a faster version of the brute-force filter:

[junc for junc in junctions if junc.overlaps(interval)]

The results will also be sorted by 5’.

Parameters:

interval (Interval) – The query interval.

Returns:

All junctions that overlap interval.

Return type:

list of JunctionReadAlignments

find_exact(interval)[source]

Returns all junctions that span exactly interval.

This function is a faster version of the brute-force filter:

[junc for junc in junctions if junc.interval == interval]

The results will also be sorted by 5’ and then 3’.

Parameters:

interval (Interval) – The query interval.

Returns:

All junctions that span exactly interval.

Return type:

list of JunctionReadAlignments

property stranded

If True then the strand is significant when calling the find_x methods.

Returns:

Whether this table can contain negative stranded intervals.

Return type:

bool

__getitem__(index)[source]

Access to all junctions.

Allows iteration over all junctions in the read distribution table:

for junc in junctions:
    ...

The results will be ordered as in the source file.

Parameters:

index (int) – The integer index of the requested junction.

Returns:

The junction read alignment object identified by the given index.

Return type:

JunctionReadAlignments

__repr__()[source]

Return repr(self).

JunctionReadAlignments

class JunctionReadAlignments[source]

A set of read alignments across a junction.

Bases: Interval

__init__()[source]
property interval

The interval spanned by this junction.

Note: All junction intervals are on the reference strand, even if they comprise reads that were mapped to the reverse strand. When querying junctions, your query intervals must therefore (currently) be on the positive strand.

Note that JunctionReadAlignments also inherits from Interval.

Returns:

The interval spanned by this junction.

Return type:

Interval

property num_reads

The number of reads across this junction.

Returns:

The number of reads across this junction, equivalent to the len of this object.

Return type:

int

__getitem__(index)[source]

Access to a junction read alignment.

Allows iteration over all read alignments across this junction:

for align in junction:
    ...
Parameters:

index (int) – The integer index of the requested alignment entry.

Returns:

The junction read alignment identified by the given index.

Return type:

JunctionReadAlignment

__repr__()[source]

Return repr(self).

JunctionReadAlignment

class JunctionReadAlignment[source]

A junction read alignment.

The left and right overhangs are positive integers representing the number of matched positions on either side of the junction.

Overhangs are counted on the private genome from which the data was sequenced, and may not map to reference genome coordinates. As such, one should not expect meaningful reference genome coordinates by adding overhangs to the junction start/end.

If strand is positive, then the read started left positions before the first intronic position on the private genome. If strand is negative, then the read started right positions after the last intronic position on the private genome.

property left

The left overhang relative to the junction.

Returns:

The number of matched positions left of the junction.

Return type:

int

property right

The right overhang relative to the junction.

Returns:

The number of matched positions right of the junction.

Return type:

int

property strand

The strand (“+” or “-”) that this read was aligned to.

Not to be confused with the strand of the junction itself, which may or may not have been inferred.

Returns:

The strand the read was aligned to by the original aligner.

Return type:

chr

property num_variants

The number of variants observed in this read alignment.

Returns:

The number of variants in the alignment.

Return type:

int

variants()[source]

A list of variants observed in this read alignment.

Returns:

A tuple of variants that this read contains

Return type:

tuple of Variant

__repr__()[source]

Return repr(self).

ReadDistributions

class ReadDistributions(infile)[source]

Access to read distributions.

__init__(infile)[source]

Open a .rdist file.

Parameters:

infile (str) – The .rdist file to open.

property junctions

Access to read distributions that are split across junctions.

Returns:

An object that supports looping or queries over junctions, where each junction provides a read distribution.

Return type:

JunctionReadDistributionTable

property filename

The path to the file from which read distributions are retrieved.

Returns:

The path to the file, e.g. “/mnt/data/sample000132.rdist”

Return type:

str

close()[source]

Close the file handle.

Note that if you use a with statement the file handle is automatically closed:

# Open the file
with ReadDistributions('my_file.ralign') as raligns:
    ...

# <-- File is now closed
static rdist_version()[source]

The rdist file format version that this build supports.

Attempting to open an rdist file of a mismatched version will raise an IOError.

Returns:

The version number.

Return type:

int

static build_rdist(outfile, infiles, min_reads=0, min_overhang=0, exclude=None, allow=None, pass_filter=None)[source]

Build an rdist file from one or more jralign files.

If multiple input files are specified, their read counts will be pooled.

Once an rdist file is created, it can be opened by creating a ReadDistributions object .

Parameters:
  • outfile (str) – The path to the destination rdist file.

  • infiles (list of str) – The paths to the source jralign files.

  • min_reads (int) – Optional. The minimum number of reads across a junction to be considered for inclusion in the output file.

  • min_overhang (int) – Optional. The minimum overhang for a read to be included in a count.

  • exclude (list of Interval) – Optional. Junctions within these intervals will be excluded.

  • allow (list of Interval) – Optional. Only junctions within these intervals will be included, so long as they are not excluded.

  • pass_filter (Callable[[str, int, int], bool], optional) –

    A pass filter to be applied while iterating over the (infile, junction_index, read_index). For example:

    # downsample to 10%
    pass_filter=lambda _, _, _: np.random.binomial(1, 0.1) != 0
    
    # only take reads that have variants
    jraligns = {x: JReadAlignments(x) for x in infiles}
    pass_filter=lambda infile, junction, read: jraligns[infile].junctions[junction][read].num_variants != 0
    ...
    for x in jraligns.values():
        x.close()
    

static from_jraligns(infiles, min_reads=0, min_overhang=0, exclude=[], allow=[], pass_filter=None, cache=True, outfile=None)[source]

Returns a ReadDistributions object opened from a set of jralign files.

This convenience function builds and then returns ReadDistributions in one step, by default the binary file (.rdist) is adjacent to the first input file.

It is intended to be used as follows:

rdist = ReadDistributions.from_jraligns(["input.jralign"])
junctions = rdist.find_within(brca1_interval)
...

If any of the filtering parameters or infiles contents change, cache should be set to False to regenerate the ReadDistribution.

See also

build_rdist()

Parameters:
  • infiles (list of str) – The paths to the source jralign files.

  • min_reads (int) – Optional. The minimum number of reads across a junction to be considered for inclusion in the output file.

  • min_overhang (int) – Optional. The minimum overhang for a read to be included in a count.

  • exclude (list of Interval) – Optional. Junctions within these intervals will be excluded.

  • allow (list of Interval) – Optional. Only junctions within these intervals will be included, so long as they are not excluded.

  • pass_filter (Callable[[str, int, int], bool], optional) –

    A pass filter to be applied while iterating over the (infile, junction_index, read_index). For example:

    # downsample to 10%
    pass_filter=lambda _, _, _: np.random.binomial(1, 0.1) != 0
    
    # only take reads that have variants
    jraligns = {x: JReadAlignments(x) for x in infiles}
    pass_filter=lambda infile, junction, read: jraligns[infile].junctions[junction][read].num_variants != 0
    ...
    for x in jraligns.values():
        x.close()
    

  • cache (bool, optional) – Whether to re-use the binary or to re-generate it from infiles.

  • outfile (str, optional) – The path to the destination rdist file.

Returns:

The opened rdist file.

Return type:

ReadDistributions

__repr__()[source]

Return repr(self).

JunctionReadDistributionTable

class JunctionReadDistributionTable[source]
find_5p_aligned(interval)[source]

Returns all junctions that have 5’ end aligned with 5’ end of interval.

This function is a faster version of the brute-force filter:

[junc for junc in junctions if junc.end5 == interval.end5]
Parameters:
  • interval (Interval) – The query interval.

  • 5'. (The results will also be sorted by)

Returns:

All junctions that have 5’ end aligned to the 5’ end of interval.

Return type:

list of JunctionReadDistribution

find_3p_aligned(interval)[source]

Returns all junctions that have 3’ end aligned with 3’ end of interval.

This function is a faster version of the brute-force filter:

[junc for junc in junctions if junc.end3 == interval.end3]

The results will also be sorted by 3’.

Parameters:

interval (Interval) – The query interval.

Returns:

All junctions that have 3’ end aligned to the 3’ end of interval.

Return type:

list of JunctionReadDistribution

find_5p_within(interval)[source]

Returns all junctions that have 5’-most position within interval.

This function is a faster version of the brute-force filter:

[junc for junc in junctions if junc.end5.expand(0, 1) in interval]

The results will also be sorted by 5’.

Parameters:

interval (Interval) – The query interval.

Returns:

All junctions that have 5’-most position falling entirely within interval.

Return type:

list of JunctionReadDistribution

find_3p_within(interval)[source]

Returns all junctions that have 3’-most position within interval.

This function is a faster version of the brute-force filter:

[junc for junc in junctions if junc.end3.expand(1, 0) in interval]

The results will also be sorted by 3’.

Parameters:

interval (Interval) – The query interval.

Returns:

All junctions that have 3’-most position falling entirely within interval.

Return type:

list of JunctionReadDistribution

find_within(interval)[source]

Returns all junctions that fall entirely within interval.

This function is a faster version of the brute-force filter:

[junc for junc in junctions if junc in interval]

The results will also be sorted by 3’.

Parameters:

interval (Interval) – The query interval.

Returns:

All junctions that fall entirely within interval.

Return type:

list of JunctionReadDistribution

find_overlapping(interval)[source]

Returns all junctions that overlap interval.

This function is a faster version of the brute-force filter:

[junc for junc in junctions if junc.overlaps(interval)]

The results will also be sorted by 5’.

Parameters:

interval (Interval) – The query interval.

Returns:

All junctions that overlap interval.

Return type:

list of JunctionReadDistribution

find_exact(interval)[source]

Returns all junctions that span exactly interval.

This function is a faster version of the brute-force filter:

[junc for junc in junctions if junc.interval == interval]

The results will also be sorted by 5’ and then 3’.

Parameters:

interval (Interval) – The query interval.

Returns:

All junctions that span exactly interval.

Return type:

list of JunctionReadDistribution

property stranded

If True then the strand is significant when calling the find_x methods.

Returns:

Whether this table can contain negative stranded intervals.

Return type:

bool

__getitem__(index)[source]

Access to all junctions.

Allows iteration over all junctions in the read distribution table:

for junc in junctions:
    ...

The results will be ordered as in the source file.

Parameters:

index (int) – The integer index of the requested junction.

Returns:

The junction read distribution object identified by the given index.

Return type:

JunctionReadDistribution

__repr__()[source]

Return repr(self).

JunctionReadDistribution

class JunctionReadDistribution[source]

A read distribution across a junction. A junction read distribution comprises a list of non-zero counts across the junction, where the shift and strand of each count is preserved.

Each read mapping across the junction corresponds to two counts: one for the part mapping upstream of the junction and another for the part mapping downstream of it.

Bases: Interval

__init__()[source]
property interval

The interval spanned by this junction.

Note: All junction intervals are on the reference strand, even if they comprise reads that were mapped to the reverse strand. When querying junctions, your query intervals must therefore (currently) be on the positive strand.

Note that JunctionReadDistribution also inherits from Interval.

Returns:

The interval spanned by this junction.

Return type:

Interval

property num_counts

The number of counts on this junction.

Returns:

The number of unique non-zero counts stored on this junction, equivalent to the len of this object.

Return type:

int

property num_reads

The number of reads across this junction.

Returns:

The number of reads across this junction, equal to the summing all counts and dividing by two.

Return type:

int

__getitem__(index)[source]

Access to a junction read count.

Allows iteration over all read counts across this junction:

for count in junction:
    ...
Parameters:

index (int) – The integer index of the requested count entry.

Returns:

The junction read count identified by the given index.

Return type:

JunctionReadCount

__repr__()[source]

Return repr(self).

JunctionReadCount

class JunctionReadCount[source]

A junction read count at a specific shift from the junction.

property strand

The strand that this count refers to. :returns: The strand that this count refers to. :rtype: str

property shift

The number of aligned positions between the far end (start/ end) of the alignment and the junction.

If negative, the count refers to an alignment upstream of the junction interval (always defined on the positive strand). Does not count insertions, deletions, clippings, etc.

Returns:

The shift relative to the junction.

Return type:

int

property count

The number of reads at the given strand and shift from junction. :returns: The number of reads merged into the count. :rtype: int

__repr__()[source]

Return repr(self).

VCF Tables

VCFTable

class VCFTable(infile)[source]

Open a .vcfbin file for querying.

See the VCF section of Quickstart for examples of how to use this class, or see demos/query_vcf.py.

To convert from .vcf[.gz] see build_vcfbin().

Parameters:

infile (str) – The path to the .vcfbin file.

__init__(infile)[source]
static from_vcf(vcfpath: str, reference_genome: Genome, *args, normalize: bool | Literal['EBI', 'UCSC'] = False, cache: bool = True, vcfbinpath: str | None = None, **kwargs) VCFTable[source]

Returns a VCFTable object opened to a VCF file.

The function internally converts the VCF to a binary format and then returns a VCFTable object for fast access to the variants and fields within. By default the binary file (.vcf.bin) will be written adjacent to the input file.

It is intended to be used as follows:

def open_clinvar(vcfpath):
    return VCFTable.from_vcf(vcfpath, "hg19", info_ids=["AF_EXAC", "CLNSIG"])

vcf = open_clinvar("clinvar_20180128.vcf.gz")
variants = vcf.find_within(brca1_interval)
...

If the input VCF contains multi-allelic sites, then use normalize to automatically split them on conversion. This step requires samtools and bcftools to be installed.

See also

build_vcfbin()

Parameters:
  • vcfpath – Path to input .vcf or `.vcf.gz file.

  • reference_genome – The reference genome (e.g. “hg19”). Forwarded to build_vcfbin().

  • args – Parameters forwarded to build_vcfbin().

  • normalize – Whether to normalize the VCF during conversion (if applicable). If True, bcftools is used for normalization of all the EBI (Ensembl) chromosomes; alternatively, EBI or UCSC can be used to specify the naming scheme used by the VCF (eg. GTEx uses the UCSC format). Unsupported on Windows due to bcftools and samtools lack of support.

  • cache – Whether to re-use the binary or to re-generate it from vcfpath.

  • vcfbinpath – Path to the intermediate binary. Default is adjacent to vcfpath.

  • kwargs – Parameters forwarded to build_vcfbin().

Returns:

The opened VCF file.

Return type:

VCFTable

find_5p_aligned(interval)[source]

Returns all variants that have 5’ end aligned with 5’ end of interval.

This function is a faster version of the brute-force filter:

[variant for variant in vcf if variant.end5 == interval.end5]

The results will also be sorted by 5’.

Parameters:

interval (Interval) – The query interval.

Returns:

All variants that have 5’ end aligned to the 5’ end of interval.

Return type:

list of VCFVariant

find_3p_aligned(interval)[source]

Returns all variants that have 3’ end aligned with 3’ end of interval.

This function is a faster version of the brute-force filter:

[variant for variant in vcf if variant.end3 == interval.end3]

The results will also be sorted by 3’.

Parameters:

interval (Interval) – The query interval.

Returns:

All variants that have 3’ end aligned to the 3’ end of interval.

Return type:

list of VCFVariant

find_5p_within(interval)[source]

Returns all variants that have 5’ end within interval.

This function is a faster version of the brute-force filter:

[variant for variant in vcf if variant.end5 in interval]

The results will also be sorted by 5’.

Parameters:

interval (Interval) – The query interval.

Returns:

All variants that have 5’ end falling entirely within interval.

Return type:

list of VCFVariant

find_3p_within(interval)[source]

Returns all variants that have 3’ end within interval.

This function is a faster version of the brute-force filter:

[variant for variant in vcf if variant.end3 in interval]

The results will also be sorted by 3’.

Parameters:

interval (Interval) – The query interval.

Returns:

All variants that have 3’ end falling entirely within interval.

Return type:

list of VCFVariant

find_within(interval)[source]

Returns all variants that fall entirely within interval.

This function is a faster version of the brute-force filter:

[variant for variant in vcf if variant in interval]

The results will also be sorted by 3’.

Parameters:

interval (Interval) – The query interval.

Returns:

All variants that fall entirely within interval.

Return type:

list of VCFVariant

find_overlapping(interval)[source]

Returns all variants that overlap interval.

This function is a faster version of the brute-force filter:

[variant for variant in vcf if variant.overlaps(interval)]

The results will also be sorted by 5’.

Parameters:

interval (Interval) – The query interval.

Returns:

All variants that overlap interval.

Return type:

list of VCFVariant

find_exact(interval)[source]

Returns all variants that span exactly interval.

This function is a faster version of the brute-force filter:

[variant for variant in vcf if variant.interval == interval]

The results will also be sorted by 5’ and then 3’.

Parameters:

interval (Interval) – The query interval.

Returns:

All variants that span exactly interval.

Return type:

list of VCFVariant

property stranded

If True then the strand is significant when calling the find_x methods.

Returns:

Whether this table can contain negative stranded intervals.

Return type:

bool

__getitem__(index)[source]

Access to all variants.

Allows iteration over all variants:

for variant in variants:
    ...

The results will be ordered as in the source file.

Parameters:

index (int) – The integer index of the requested variant.

Returns:

The variant object identified by the given index.

Return type:

VCFVariant

Notes

The VCFVariant will be invalidated after this table has been closed via close() or a with statement.

__len__()[source]

Return len(self).

where(mask)[source]

Filter variants by numpy mask.

Fast extraction of table elements based on a mask:

# Intended to be used like this.
variants = vcf.where(mask)

# It is faster but equivalent to this.
variants = [vcf[i] for i in np.where(mask)[0]]
Parameters:

mask (ndarray) – A boolean mask the same size as the table.

Returns:

A list of variants selected by the mask.

Return type:

list of VCFVariant

masked(mask)[source]

Return a new VCFTable which is filtered by a numpy mask

For example, to do a random downsample:

downsampled = vcf.masked(np.random.binomial(1, 0.1, len(vcf)).astype(bool))

Or to subset by only ABCA4 genes:

abca4_vcf = vcf.masked(vcf.index_of(v) for v in vcf.find_overlapping(gene))

Notes

The masked VCFTable does not share resources with the original, so close() should be called on both instances when required.

Parameters:

mask (ndarray of bool or ndarray of int) – A boolean mask the same size as the table or a list of indices.

Returns:

A list of variants selected by the mask.

Return type:

list of VCFVariant

GT_HOMOZYGOUS_REF = 0

Constant indicating a 0/0 or 0|0 in GT format value.

GT_HETEROZYGOUS_UNPHASED = 1

Constant indicating a 0/1 in GT format value.

GT_HOMOZYGOUS_ALT = 2

Constant indicating a 1/1 or 1|1 in GT format value.

GT_UNKNOWN = 3

Constant indicating a ?/? or ?|? in GT format value.

GT_HETEROZYGOUS_PHASED_0_1 = 4

Constant indicating a 0|1 in GT format value.

GT_HETEROZYGOUS_PHASED_1_0 = 5

Constant indicating a 1|0 in GT format value.

SVTYPE_NA = 0

Constant indicating the SVTYPE was missing.

property num_samples

The number of samples represented in the VCF.

Returns:

Number of samples represented in the VCF file, for which FORMAT data was provided.

Return type:

int

property sample_names

The names of the samples represented in the VCF if it was encoded with format IDs.

Returns:

A list of the sample IDs in the same order as the original VCF. The indices correspond to the format values.

Return type:

list of str

property info_ids

The INFO fields available for info() that were encoded in build_vcfbin() or from_vcf().

Returns:

A list of INFO ids that can be passed into info().

Return type:

list of str

info(info_id)[source]

The full per-variant data array associated with an INFO field. String INFO fields are accessed by VCFVariant attribute, eg. variant.CIGAR.

Parameters:

info_id (str) – The INFO ID from the VCF, such as AD or DP.

Returns:

A complete array of INFO data values associated with info_id.

If there is one value per variant, the results is 1-dimensional:

INFO
DP=1
DP=4

[1 4]

If there are multiple values per variant, the results is 2-dimensional:

INFO
AD=0,1
AD=2,2

[[0 1]
 [2 2]]

Return type:

ndarray

Raises:
  • KeyError – Raised if info_id is unrecognized, because it was not in the original VCF or included by build_vcfbin()

  • TypeError – Raised if info_id refers to a String valued column. String values are accessed by VCFVariant attribute.

property format_ids

The INFO fields available for format() that were encoded in build_vcfbin() or from_vcf().

Returns:

A list of FORMAT ids that can be passed into format().

Return type:

list of str

format(format_id)[source]

The full per-variant per-sample data matrix associated with a FORMAT field.

Parameters:

format_id (str) – The FORMAT ID from the VCF, such as AD or GT.

Returns:

A complete array of FORMAT data values associated with format_id.

If there is one value per variant and per sample, the results is 2-dimensional:

sample1 sample2 sample3
1       2       3
4       5       6

[[1 2 3]
 [4 5 6]]

If there are multiple values per variant and per sample, the results is 3-dimensional:

sample1 sample2 sample3
1,2     3,4     5,6
7,8     9,10    11,12

[[[1 2] [3 4]  [5  6 ]]
 [[7 8] [9 10] [11 12]]]

Return type:

ndarray

Raises:

KeyError – Raised if format_id is unrecognized, either because it was not in the original VCF or included by build_vcfbin().

static build_vcfbin(outfile, infile, reference_genome, info_ids=None, fmt_ids=None, validate=True, exclude=None, allow=None, ancestral='error')[source]

Builds a binary .vcfbin file from a .vcf[.gz].

The resulting file can be opened by VCFTable.

VCF normalization. The input VCF should be normalized, and multi-allelic sites must be split. Consider running infile through bcftools norm -m - -f. Examples:

REF    ALT
CT     CAT     # FAIL (not normalized)
       A       # OK
.      A       # OK
-      A       # OK
C      CA      # OK
C      A,G     # FAIL (multi-allelic)

Extracting INFO/FORMAT columns. By default, INFO and FORMAT columns such as DP and AD are omitted. Use info_ids and fmt_ids arguments to specify which columns you want included:

["GT", "AD"]                        # Keep GT and AD (as int8 and int32 respectively)
{"GT" : None, "AD" : None}          # Same
{"GT" : None, "AD" : -1}            # Same, but store -1 for any missing AD value
{"GT" : None, "AD" : np.int16(0) }  # Keep GT and AD (as int8 and int16 respectively)

The GT (genotype) FORMAT and SVTYPE (structural variant type) INFO columns are special. They default to int8, and the string values are stored as integer constants:

*GT*

*SVTYPE*

  • A missing SVTYPE is represented as VCFTable.SVTYPE_NA (0)

  • DEL is stored as VCFTable.SVTYPE_DEL (1)

  • INS is stored as VCFTable.SVTYPE_INS (2)

  • DUP is stored as VCFTable.SVTYPE_DUP (3)

  • INV is stored as VCFTable.SVTYPE_INV (4)

  • CNV is stored as VCFTable.SVTYPE_CNV (5)

  • BND is stored as VCFTable.SVTYPE_BND (6)

The ID INFO column is stored as an INFO ID. If it is required (e.g., you need to find a MATEID pair), pass “ID” into info_ids.

Missing values are also treated differently: GT values must be specified as a diploid. If a single missing allele (e.g. 0/., ./0, 1/., or ./1) should be encoded as a possible VCFTable.GT_HETEROZYGOUS_UNPHASED instead of VCFTable.GT_UNKNOWN, the missing value should be set to VCFTable.GT_HETEROZYGOUS_UNPHASED. ./. will still be interpreted as VCFTable.GT_UNKNOWN:

{"GT" : VCFTable.GT_HETEROZYGOUS_UNPHASED}

For phased values, the default value is ignored. For example, .|1 will be encoded as VCFTable.GT_UNKNOWN.

All other columns default to int32, float32, or str as defined by the VCF Type header (can override bitwidth for performance or to save disk space.) For integer/float columns, only constant value dimensions are currently supported. By default, a missing value is substituted with 0, 0.0, or “”. If missing integer values must be distinguishable, consider using -(2**31) to (2**31+7) as these are reserved by VCF.

String-valued columns String values can be accessed by attribute (variant.CIGAR). Unlike numeric columns, a string column cannot be accessed as a numpy array.

Caveats: Multi-dimensional values are not split, and ASCII-encoded VCF values, eg. %3A for :, will not be decoded. . are not decoded as empty strings unless it is the sole value.

Value multiplicity. Each INFO/FORMAT field has an expected number of values per sample, following the VCF 4.2 Specification sections 1.4.2 and 1.6.2. For example:

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths">

Here, a FORMAT value of 0/1:5,8 for a single sample would be valid because we expect one GT value per (Number=1) and two AD values (Number=R).

FORMAT values with multiple samples or INFO/FORMAT values that are multidimensional can only be accessed via info() or format() and cannot be accessed by VCFVariant attributes.

Beware VCF headers that incorrectly specify Number=. instead of R, as their multi-allelic variants will be split incorrectly by bcftools.

Parameters:
  • outfile (str) – The binary output file. This file can then be opened by VCFTable.

  • infile (str) – The path to the source .vcf[.gz] file.

  • reference_genome (Genome) – The reference genome that infile uses as #reference field. The parameter is not validated against the VCF, so it is up to the caller to ensure it is correct. For example, a #reference=b37.fa or =NCBI37 is a match to "hg19".

  • info_ids (list of str | dict of str to object, optional) – Controls how INFO columns are preserved in outfile. See above.

  • fmt_ids (list of str | dict of str to object, optional) – Controls how FORMAT columns are preserved in outfile. See above.

  • validate (bool, optional) – Whether to validate REF sequences against the reference genome. Default is True.

  • exclude (list of Interval) – Optional. Variants within these intervals will be excluded.

  • allow (list of Interval) – Optional. Only variants within these intervals will be included, so long as they are not excluded.

  • ancestral (str) – Behaviour when encountering ancestral alleles. Valid values are “error”, “warn”, and “exclude”. Defaults to “error”.

Raises:

ValueError

Under the following conditions:

  • The VCF is malformed.

  • A row in infile does not contain every FORMAT ID specified by fmt_ids.

  • An INFO/FORMAT ID is unsupported.

  • A REF or ALT sequence contained a VCF non-standard nucleotide code.

  • A REF sequence failed to validate.

  • An INFO/FORMAT field type is incompatible with the VCF Type header definition.

static vcfbin_version()[source]

The vcfbin file format version that this build supports.

Attempting to open an vcfbin file of a mismatched version will raise an IOError.

Returns:

The version number.

Return type:

int

sequence_variations(interval)[source]

Find variants that could affect the given interval’s DNA sequence. When specified with an anchor, deletion variants will increase the interval’s span; this can be overestimated (due to overlapping deletions), but the spurious variants will be filtered during DNA sequence retrieval.

The results will also be sorted by 5’.

Parameters:

interval (Interval) – The dna sequence interval. Anchoring will expand the interval when deletion variants are encountered.

Returns:

A list of variants that possibly affect the given interval’s DNA sequence.

Return type:

list of VCFVariant

close()[source]

Close the file handle.

Note that if you use a with statement the file handle is automatically closed:

# Open the file
with VCFTable('foo.vcfbin') as vcfbin:
    ...

# <-- File is now closed

Notes

Closing this table will invalidate all VCFVariants previously accessed via __getitem__.

property filename

The path to the file from which FORMAT field matrices are retrieved.

Returns:

The path to the VCF file.

Return type:

str

__repr__()[source]

Return repr(self).

VCFVariant

class VCFVariant[source]

A VCF variant. Scalar INFO/FORMAT fields can be accessed as a dynamic attribute, eg. variant.AD.

Bases: Variant

__init__()[source]
as_variant_string()[source]

Returns a string representation of the variant.

See also

from_string()

Returns:

The variant in the format "chromosome:start:ref:alt", where start is in DNA1 coordinates.

Return type:

str

__str__()

Returns a string representation of the variant.

See also

from_string()

Returns:

The variant in the format "chromosome:start:ref:alt", where start is in DNA1 coordinates.

Return type:

str

__repr__()[source]

Return repr(self).

VariantTable

class VariantTable[source]
find_5p_aligned(interval)[source]

Returns all variants that have 5’ end aligned with 5’ end of interval.

This function is a faster version of the brute-force filter:

[v for v in variants if v.end5 == interval.end5]

The results will also be sorted by 5’.

Parameters:

interval (Interval) – The query interval.

Returns:

All variants that have 5’ end aligned to the 5’ end of interval.

Return type:

list of Variant

find_3p_aligned(interval)[source]

Returns all variants that have 3’ end aligned with 3’ end of interval.

This function is a faster version of the brute-force filter:

[v for v in variants if v.end3 == interval.end3]

The results will also be sorted by 3’.

Parameters:

interval (Interval) – The query interval.

Returns:

All variants that have 3’ end aligned to the 3’ end of interval.

Return type:

list of Variant

find_5p_within(interval)[source]

Returns all variants that have 5’-most position within interval.

This function is a faster version of the brute-force filter:

[v for v in variants if v.end5.expand(0, 1) in interval]

The results will also be sorted by 5’.

Parameters:

interval (Interval) – The query interval.

Returns:

All variants that have 5’-most position falling entirely within interval.

Return type:

list of Variant

find_3p_within(interval)[source]

Returns all variants that have 3’-most position within interval.

This function is a faster version of the brute-force filter:

[v for v in variants if v.end3.expand(1, 0) in interval]

The results will also be sorted by 3’.

Parameters:

interval (Interval) – The query interval.

Returns:

All variants that have 3’-most position falling entirely within interval.

Return type:

list of Variant

find_within(interval)[source]

Returns all variants that fall entirely within interval.

This function is a faster version of the brute-force filter:

[v for v in variants if v in interval]

The results will also be sorted by 3’.

Parameters:

interval (Interval) – The query interval.

Returns:

All variants that fall entirely within interval.

Return type:

list of Variant

find_overlapping(interval)[source]

Returns all variants that overlap interval.

This function is a faster version of the brute-force filter:

[v for v in variants if v.overlaps(interval)]

The results will also be sorted by 5’.

Parameters:

interval (Interval) – The query interval.

Returns:

All variants that overlap interval.

Return type:

list of Variant

find_exact(interval)[source]

Returns all variants that span exactly interval.

This function is a faster version of the brute-force filter:

[v for v in variants if v.interval == interval]

The results will also be sorted by 5’ and then 3’.

Parameters:

interval (Interval) – The query interval.

Returns:

All variants that span exactly interval.

Return type:

list of Variant

property stranded

If True then the strand is significant when calling the find_x methods.

Returns:

Whether this table can contain negative stranded intervals.

Return type:

bool

__getitem__(index)[source]

Access to all variants.

Allows iteration over all variants in the table:

for variant in table:
    ...

The results will be ordered as in the source file.

Parameters:

index (int) – The integer index of the requested junction.

Returns:

The variant at the given index.

Return type:

Variant

where(mask)[source]

Filter variants by numpy mask.

Fast extraction of table elements based on a mask:

# Intended to be used like this.
variants = table.where(mask)

# It is faster but equivalent to this.
variants = [table[i] for i in np.where(mask)[0]]
Parameters:

mask (ndarray) – A boolean mask the same size as the table.

Returns:

A list of variants selected by the mask.

Return type:

list of Variant

__repr__()[source]

Return repr(self).

Data Access

genome_kit.DataManager

class DataManager(data_dir: str)[source]
__init__(data_dir: str)[source]

Implementations should accept a data directory. :param data_dir: location where local files are cached

get_file(filename: str) str[source]

Get path to a managed data file and download if necessary.

Parameters:

filename – A single filename, without a path.

Return type:

The path to the local version of the file.

upload_file(filepath: str, filename: str, metadata: Dict[str, str] = None)[source]

Upload a file to make it available in GenomeKit.

Parameters:
  • filepath – Full path to the file to be uploaded.

  • filename – The remote filename.

  • metadata – Extra metadata to attach to the file.

list_available_genomes()[source]

List all available genomes in the data manager

__weakref__

list of weak references to the object

genome_kit.DefaultDataManager

class DefaultDataManager(data_dir: str, bucket_name='genomekit-data-public', require_auth=False)[source]

A minimal data manager implementation that retrieves files from a S3 bucket. When using the default bucket, uploads are only supported for GenomeKit maintainers.

__init__(data_dir: str, bucket_name='genomekit-data-public', require_auth=False)[source]
Parameters:
  • data_dir – location where local files are cached

  • bucket_name – S3 bucket. By default, use the bucket sponsored by AWS Open Data

  • require_auth – set True for non-public buckets

get_file(filename: str) str[source]

Get path to a managed data file and download if necessary.

Parameters:

filename – A single filename, without a path.

Return type:

The path to the local version of the file.

upload_file(filepath: str, filename: str, metadata: Dict[str, str] = None)[source]

Upload a file to make it available in GenomeKit.

Parameters:
  • filepath – Full path to the file to be uploaded.

  • filename – The remote filename.

  • metadata – Extra metadata to attach to the file.

list_available_genomes()[source]

List all available genomes in the data manager

genome_kit.GCSDataManager

class GCSDataManager(data_dir: str, bucket_name: str)[source]

A minimal data manager implementation that retrieves files from a GCS bucket.

__init__(data_dir: str, bucket_name: str)[source]
Parameters:
  • data_dir – location where local files are cached

  • bucket_name – GCS bucket

get_file(filename: str) str[source]

Get path to a managed data file and download if necessary.

Parameters:

filename – A single filename, without a path.

Return type:

The path to the local version of the file.

upload_file(filepath: str, filename: str, metadata: Dict[str, str] = None)[source]

Upload a file to make it available in GenomeKit.

Parameters:
  • filepath – Full path to the file to be uploaded.

  • filename – The remote filename.

  • metadata – Extra metadata to attach to the file.

list_available_genomes()[source]

List all available genomes in the data manager

Internals

genome_kit.gk_data

This module automatically fetches GenomeKit data.

A user should first use upload_file() to upload a file to make it available other GenomeKit users.

After the upload, the file can be used by any GenomeKit function via get_file(), which will download the file on demand and return its local path.

Example

>>> upload_file('/local/path/hg38.2bit', 'hg38.2bit')
>>> get_file('hg38.2bit')
"/Users/example/Application Support/genome_kit/hg38.2bit"
get_file(filename)[source]

compatibility wrapper for get_file()

upload_file(filepath: str, filename: str, metadata: Dict[str, str] = None)[source]

compatibility wrapper for upload_file()

wget(url, dst=None, timestamping=False, progress=False)[source]

Download a file from the URL.

If dst is unspecified, the file will be downloaded to the system temp directory and given the same name.

If the last-modified time of the remote file is available, it will be assigned to the downloaded file.

Parameters:
  • url (str) – The URL of the source file.

  • dst (str) – The path to the local destination.

  • timestamping (bool) – If true, only download when last-modified date if the remote file is newer than the local file. If the server does not provide the last-modified date, the local file will be kept by default.

  • progress (bool) – If true, show a progress bar.

Returns:

The path to the local version of the file.

Return type:

str

Raises:
  • IOError – There was an error opening dst.

  • HTTPError – There was an error opening the URL.

  • socket.timeout – The connection timed out or was interrupted.

genome_kit.data_manager

class ProgressPercentage(filename, filesize=None, progress_msg='Progress')[source]

Display a progress bar on a text terminal showing file being transferred, completion percentage, estimated transfer time and throughput.

`Progress "myfile.bin" [ 20.0 B/ 20.0 B] 100% |###############| ETA:  0:00:00 147.8 B/s`

__init__(filename, filesize=None, progress_msg='Progress')[source]
Parameters:
  • filename (str) – file name to display, will be truncated for display if necessary

  • filesize (int) – file size in bytes

  • progress_msg (str) – message to display before filename, e.g. “[MyPackage] downloading”, defaults to “”

__call__(bytes_amount)[source]

Call self as a function.

__weakref__

list of weak references to the object

genome_kit._twobitutils

write2bit(outfile, seqs)[source]

Write a 2bit file containing the given DNA sequences.

Parameters:
  • outfile (str) – The name of the output file.

  • seqs (dict) – A dictionary of named DNA sequence strings each of which will be represented by name. Sequence names can be arbitrary. Sequences may contain a mix of upper/lower case ACGTN characters.

Examples

>>> seqs={
...   "chr1" : "AAAANNnnCCAAaattTT",
...   "chr2" : "NNNnnnnggggGGGG",
... }
>>> write2bit('mydna.2bit', seqs)

genome_kit._apply_variants

apply_variants(sequence, variants, interval, reference_alignment=False)[source]

Apply variants to a sequence interval.

This function ignores the interval’s strand attribute and always returns the sequence on the reference strand.

Parameters:
  • sequence (GenomeDNA) – The source from which to extract DNA.

  • variants (list of Variant) – A list of variants.

  • interval (Interval | Coordinate) – An anchored interval/coordinate specifying what to extract.

  • reference_alignment (bool) – Whether to compute the alignment between the variant and reference sequence.

Returns:

  • str – The variant DNA sequence.

  • list – The alignment of the variant sequence with the reference sequence (if reference_alignment=True).

genome_kit._cxx_util

mock(arg)[source]

Marks a class, attribute, or method as being a ‘mock’.

When Python-defined class A inherits from C++-defined class B, the attributes and methods B are only partially visible to the IDE.

The mock function allows the definition of A to mock the attributes and methods of B, making them accessible the IDE (for autocomplete) and to automatic doc extraction (e.g. sphinx).

For example, the C++-defined genome_kit._cxx.GeneTable implements __getitem__, which returns an instance of the Python-defined type genome_kit.Gene. The Python-defined class genome_kit.GeneTable is defined as:

@_cxx.register
class GeneTable(_cxx.GeneTable):

    @mock
    def __getitem__(self, index):
        "Returns a Gene object"
        return mock_result(Gene)

The IDE and autodocs ignore the @mock decorator, so they treat this definition at face value and do static analysis, propagating the return type (genome_kit.Gene) when doing autocomplete:

for gene in genome.genes:      # genes is of type GeneTable
    gene.<autocomplete>        # gene is of type Gene

When a class is registered with the C++ backend, all mock properties and methods are deleted from the type, so that all requests for those attributes will fall through directly to the C++ implementation in the base class.

See the PyDeleteMockAttrs C++ function for the code that strips some mock symbols from each registered type.

mock_result(result_type)[source]

Return a mock result of one or more types.

This function is used to give IDEs a hint during static analysis.

This function merely calls mock_unreachable() before returning an instance of result_type. It’s important to print the error message before trying to instantiate the return type.

See also: mock()

mock_unreachable()[source]

Prints an unreachable code warning message.

If a mock attribute or method is ever called, it means there’s some mismatch between a Python-defined type and a C++-defined type.

An alternative behaviour would be to raise an exception, but that causes headaches when pylint and PyCharm try to warn about unreachable code.

The reason this function is provided separately from mock_result is that it allows the user to either return nothing:

def foo(self):
    mock_unreachable()

or to return an object that provides more complex type hints:

def foo(self):
    mock_unreachable()
    return [str()]  # type: List[str]

See also: mock_result()

strip_mock_bases(subtype)[source]

Strips the last base classes at runtime from the given type.

Given a class hierarchy X->Y->Z where X and Z are defined in Python but Y is defined in C++, then the IDE does not know that Z inherits from X, and so it cannot provide autocomplete for members of X.

This decorator is a hacky way to trick IDEs into statically parsing out the inheritance structure, while removing the ‘trick’ at runtime.

For example, the class hierarchy we want for Gene is simply:

@_cxx.register
class Gene(_cxx.Gene):  # Interval -> _cxx.Gene -> Gene
    ...

However, IDEs will not see that Gene inherits from Interval. So, instead we declare Gene as follows:

@strip_mock_bases
@_cxx.register
class Gene(_cxx.Gene, Interval):
    ...

The IDE will now think that Gene inherits Interval via multiple inheritance, and will IDE autocomplete accordingly, even though at runtime we’ll get the linear inheritance scheme that we want.