Quickstart

Here is a quick tutorial covering GenomeKit’s most important functionality.

Installation

Install via conda

The best way to install GenomeKit is via the pre-compiled conda packages available from https://anaconda.org/conda-forge/genomekit.

You can install GenomeKit with:

conda install -c conda-forge genomekit

Use with docker

Alternatively you can use GenomeKit with docker:

$ docker run -it --rm deepgenomicsinc/genomekit:latest python
>>> import genome_kit

Basics

Initial Data Access

Data, such as assemblies, annotations, tracks, etc, are stored in custom-built binary files.

These files back GenomeKit objects and are required to query the genomic sequences and locations you are interested in. APIs for building these files are provided as part of the API.

A selection of pre-built data files is provided in a public S3 bucket, which is set as the default data source. If you prefer to generate the data locally, see Generating data with Docker below.

Generating data with Docker

Clone the repo:

git clone https://github.com/deepgenomics/GenomeKit.git
pushd GenomeKit

Scripts under data-src are used to obtain and generate the data files:

  • data-src/<assembly>/assembly for the assembly, e.g data-src/hg19/assembly

  • data-src/<assembly>/<annotation-source>/<annotation>, e.g data-src/hg19/GENCODE/v26lift37

Run the data generation script for the annotation/assembly you require. Note that annotations require the assembly to be built first.

# set GENOMEKIT_DATA_DIR
export GENOMEKIT_DATA_DIR=$(python -c "import os ; import appdirs ; print(os.environ.get('GENOMEKIT_DATA_DIR', appdirs.user_data_dir('genome_kit')))")

# obtain data files for the related assembly
docker run --rm -it -v ./data-src:/data-src \
    -v $GENOMEKIT_DATA_DIR:/output -e GENOMEKIT_DATA_DIR=/output \
    --platform=linux/amd64 deepgenomicsinc/genomekit \
    python /data-src/build.py hg19.p13.plusMT/assembly /output

# generate the annotation data files
docker run --rm -it -v ./data-src:/data-src \
    -v $GENOMEKIT_DATA_DIR:/output -e GENOMEKIT_DATA_DIR=/output \
    --platform=linux/amd64 deepgenomicsinc/genomekit \
    python /data-src/build.py hg19.p13.plusMT/NCBI/v105.20190906 /output

The files should now be ready to use in your local GenomeKit data directory.

For more details on creating your own data source, see the section on Sharing data files.

Usage

Import the genome_kit package into your project:

>>> import genome_kit as gk

For brevity, this tutorial assumes you’ve imported several core types directly into your namespace:

>>> from genome_kit import Genome
>>> from genome_kit import Interval
...

Most GenomeKit code will need two core object types:

  • Interval - an interval on a reference genome

  • Genome - resources available for a reference genome

An interval is initialized from five values: chromosome, strand, start, end, and reference_genome

>>> interval = Interval("chr7", "+", 117120016, 117120201, "hg19")

Irrespective of strand, an interval always has start <= end. The span of an interval excludes the end position, so the number of bases spanned by an interval is end – start.

>>> len(interval)
185

GenomeKit intervals can be described as “0-based, exclusive end”, whereas UCSC browser positions are “1-based, inclusive end”:

>>> interval.as_ucsc()
'chr7:117120017-117120201'

A genome provides convenient access to resources associated with a reference genome:

>>> genome = Genome("hg19")  # Equivalently "hg19"
>>> genome.dna(interval)
'AATTGGAAGCAAA...AACTTTTTTTCAG'

Some resources are versioned and are only accessible if the intended version is unambiguous from the configuration strings. For example, specifying a GENCODE version enables annotations:

>>> genome = Genome("gencode.v19")             # Implies "hg19"
>>> gene = genome.genes["ENSG00000001626.10"]  # Gene object
>>> tran = gene.transcripts[2]                 # Transcript object
>>> exon = tran.exons[0]                       # Exon object
>>> genome.dna(exon)
'AATTGGAAGCAAA...AACTTTTTTTCAG'

The above exon has the same interval that we defined earlier, plus several other attributes:

>>> exon.interval
Interval("chr7", "+", 117120016, 117120201, "hg19")
>>> exon.index
0
>>> exon.transcript
<Transcript ENST00000003084.6 of CFTR>
>>> exon.cds
<Cds in Exon 1/27 of ENST00000003084.6>
>>> exon.next_exon
<Exon 2/27 of ENST00000003084.6>

Tracks

GenomeKit users can also build tracks via GenomeTrackBuilder.

Note

When building a track, data is ordered according to the strandedness argument passed to the builder’s constructor:

  • "single_stranded": both strands share the same data. The data is applied in Interval coordinate (reference strand) order.

  • "strand_unaware": ignores the Interval strand, data is applied in Interval coordinate (reference strand) order.

  • "strand_aware": data is applied from 5” end to 3” end (sense strand order).

>>> track = GenomeTrackBuilder("neg.gtrack", "u3", "strand_unaware", Genome("hg19"))
>>> interval = Interval("chr1", "-", 10, 15, "hg38")
>>> track.set_data(interval, np.arange(0, len(interval), dtype=np.uint8))
>>> track.finalize()
>>> track = GenomeTrack("neg.gtrack")
>>> track(interval)
array([[4],
       [3],
       [2],
       [1],
       [0]], dtype=uint8)
>>> track = GenomeTrackBuilder("neg.gtrack", "u3", "strand_aware", Genome("hg19"))
>>> track.set_data(interval, np.arange(0, len(interval), dtype=np.uint8))
>>> track.finalize()
>>> track = GenomeTrack("neg.gtrack")
>>> track(interval)
array([[0],
       [1],
       [2],
       [3],
       [4]], dtype=uint8)

Annotations

GenomeKit provides access to GENCODE and RefSeq annotations.

Walking the structure of an annotation is straight-forward:

genome = Genome("gencode.v19")
for gene in genome.genes:          # Each gene
    print(gene)
    for tran in gene.transcripts:  # Each transcript on the gene
        print("  ", tran)
        for exon in tran.exons:    # Each exon on the transcript
            print("     ", exon)

You can run the above example from the GenomeKit directory:

$ python demos/walk_annotations.py
<Gene ENSG00000223972.4 (DDX11L1)>
   <Transcript ENST00000456328.2 of DDX11L1>
      <Exon 1/3 of ENST00000456328.2>
      <Exon 2/3 of ENST00000456328.2>
      <Exon 3/3 of ENST00000456328.2>
...

Annotations are accessible via following attributes: genes, transcripts, exons, introns, and cdss. Each of these can be thought of as an indexed table, like in a database. For example, exons is an instance of type ExonTable and each row in that table is an instance of Exon, with the fields you’d expect.

Most importantly, each annotation table is indexed for fast positional queries:

>>> # First exon of a CFTR transcript
>>> exon = genome.transcripts["ENST00000003084.6"].exons[0]
>>> genome.exons.find_overlapping(exon)
[<Exon 1/27 of ENST00000003084.6>,
 <Exon 1/26 of ENST00000454343.1>,
 <Exon 1/26 of ENST00000426809.1>]

The following methods take a single interval argument and are available for every element table:

These methods are useful for mapping positions to annotation elements. See GenomeAnnotation for the top-level object that ultimately owns all the annotation tables.

Working with Intervals

GenomeKit uses the following convention for Interval:

  1. intervals are always stranded (+ or –),

  2. positions are internally 0-based, and

  3. the span of an interval excludes its end position.

Intervals are initialized using what is called the DNA0 convention where, irrespective of strand, where start <= end. Intervals can be empty (start == end). The resulting interval spans the same positions in the genome as a Python slice operation [start:end].

For example, an interval with start=3 and end=7 spans the four over-lined positions below, irrespective of strand:

        [__________]
0  1  2  3  4  5  6  7  8  9  10

To see what we can do with intervals, let’s create a few:

>>> #  0123456789
>>> #  aaaaabbbbb
>>> #     cccc
>>> #     d
>>> a = Interval("chr1", "+", 0,  5, "hg38")
>>> b = Interval("chr1", "+", 5, 10, "hg38")
>>> c = Interval("chr1", "+", 3,  7, "hg38")
>>> d = Interval("chr1", "+", 3,  4, "hg38")

>>> len(a), len(b), len(c), len(d)
(5, 5, 4, 1)

>>> a.contains(c), c.within(a), a.contains(d), d.within(a)
(False, False, True, True)

>>> a.overlaps(b), a.overlaps(c)
(False, True)

>>> a.upstream_of(b), b.dnstream_of(a)
(True, True)
>>> c.upstream_of(b), b.dnstream_of(c)
(False, False)

>>> a == b, a == d
(False, False)
>>> a != b, a != d
(True, True)

Intervals on opposite strands effectively live in different universes:

>>> #  0123456789
>>> #  xxxxxyyyyy
>>> #     zzzz
>>> #     w
>>> x = a.as_opposite_strand()
>>> y = b.as_opposite_strand()
>>> z = c.as_opposite_strand()
>>> w = d.as_opposite_strand()

>>> x
Interval("chr1", "-", 0, 5, "hg38")
>>> y
Interval("chr1", "-", 5, 10, "hg38")
>>> z
Interval("chr1", "-", 3, 7, "hg38")
>>> w
Interval("chr1", "-", 3, 4, "hg38")

>>> x.overlaps(d)  # opposite strands
False
>>> x.contains(d)  # opposite strands
False
>>> x.contains(w)
True

>>> x.upstream_of(y), y.upstream_of(x)
(False, True)

>>> x == a   # opposite strands
False

Given an interval, you can also build new intervals centered around its 5’ end (upstream) and 3’ end (downstream), which depends on strand:

>>> interval = Interval("chr1", "-", 4,  8, "hg38")
>>> interval.end5
Interval("chr1", "-", 8, 8, "hg38")

>>> interval.end3
Interval("chr1", "-", 4, 4, "hg38")

>>> interval.end3.expand(2, 3)
Interval("chr1", "-", 1, 6, "hg38")

Notice that the end5 and end3 attributes return empty (length-0) intervals. This is a general convention in GenomeKit: an empty interval denotes the space between two consecutive positions in the genome. This convention is useful for defining alignments or defining new intervals relative to the empty one via expand(). For example, the intervals in the above code example can be visualized as follows:

           [__________]           # Interval("chr1", "-", 4,  8, "hg38")
0  1  2  3  4  5  6  7  8  9  10

                      []          # interval.end5
0  1  2  3  4  5  6  7  8  9  10

          []                      # interval.end3
0  1  2  3  4  5  6  7  8  9  10

  [_____________]                 # interval.end3.expand(2, 3)
0  1  2  3  4  5  6  7  8  9  10

Feature extraction basics

GenomeKit currently provides DNA sequence, accessible via the dna attribute of Genome. The extracted DNA is automatically reverse-complemented according to strand:

>>> a = Interval("chr7", "+", 117120016, 117120201, "hg19")
>>> b = a.as_opposite_strand()

>>> genome = Genome("hg19")
>>> genome.dna(a)
'AATTGGAAGCAAA...AACTTTTTTTCAG'
>>> genome.dna(b)
'CTGAAAAAAAGTT...TTTGCTTCCAATT'

GenomeKit makes it easy to extract features that correspond to annotated elements, and to filter those elements by their attributes. For example, we could extract DNA from all acceptor sites that meet certain criteria:

def has_coding_seq(e): return e.cds is not None   # Has CDS?
def has_good_level(e): return e.tran.level <= 2   # Is level 1 or 2?
def not_first_exon(e): return e.index > 0         # Is not first exon?

genome = Genome("gencode.v19")                    # 1196293 exons
exons = filter(has_coding_seq, genome.exons)      #  724078 remaining
exons = filter(has_good_level, exons)             #  605573 remaining
exons = filter(not_first_exon, exons)             #  558099 remaining
sites = set(exon.end5 for exon in exons)          #  198992 unique

for site in sites:
    print(genome.dna(site.expand(5, 5)))

The above outputs the following 10nt sequences surrounding acceptor sites:

TGCAGGGAAC   # Note they are all sense-strand (AG)
TTCAGCTGCT   # because exon.end5 knows the strand.
TGTAGGAAAC
TCCAGGCTAT
GCCAGAGGAC
GACAGAACCA
CCCAGATTGG
...

GenomeKit also make it easy to map positions to the nearest annotated element. For example, we could map branch sites to their nearest downstream acceptor site:

genome = Genome("gencode.v19")

# We want to map each of these to an acceptor at most 100nt downstream
coords = [ Interval.from_dna0_coord("chr1", "-",  91661, "hg19"),
           Interval.from_dna0_coord("chr1", "-", 169295, "hg19"),
           Interval.from_dna0_coord("chr1", "+", 320861, "hg19") ]

for coord in coords:
    window = coord.expand(0, 100)                # Find all exons with
    exons = genome.exons.find_5p_within(window)  # acceptor in window
    for exon in exons:
        print(coord.as_ucsc(), "-->", exon)

The above outputs the following mappings:

chr1:91662-91662 --> <Exon 4/4 of ENST00000466430.1>
chr1:169296-169296 --> <Exon 3/8 of ENST00000466557.2>
chr1:320862-320862 --> <Exon 2/3 of ENST00000432964.1>  # 1st candidate
chr1:320862-320862 --> <Exon 2/4 of ENST00000601486.1>  # 2nd candidate
chr1:320862-320862 --> <Exon 1/3 of ENST00000599771.2>  # 3rd candidate

Variants

Individual genomic variants are represented by a Variant:

from genome_kit import Variant

A variant is defined by a chromosome, 0-based position (DNA0), reference allele, alternate allele, and reference genome:

>>> variant = Variant("chr7", 117120148, "AT", "G", "hg19")
>>> variant
<Variant chr7:117120148:AT:G:hg19>

A variant is a subclass of Interval, and also has an interval attribute. The interval spans the reference allele:

>>> variant.start, variant.end, len(variant)
(117120148, 117120150, 2)

>>> variant.interval
Interval("chr7", "+", 117120148, 117120150, "hg19")

Variants may also be created from a string where the position is 1-based (DNA1), which is the convention of UCSC and Clinvar:

>>> genome = Genome("hg19")
>>> variant = genome.variant("chr7:117,120,149:AT:G")               # First way
>>> variant = Variant.from_string('chr7:117,120,149:AT:G', genome)  # Second way
>>> variant
<Variant chr7:117120148:AT:G:hg19>

A variant created from a string is always validated against the reference genome, raising an exception if the reference allele does not match.

ENSEMBL-style chromosome names (without leading chr) are also allowed and are automatically converted.

The GenomeKit variant format includes, but is more general than, the Clinvar variant convention when it comes to insertion and deletions. Clinvar does not allow an empty ref or alt allele (requiring the allele before the indel to be repeated), while empty alleles are allowed in GenomeKit. For example, the variants chr7:117,120,150:A:- and chr7:117,120,149:CA:C are interpreted identically by GenomeKit.

Empty alleles can be specified by "", "-", or ".".

Variants from VCF Files

GenomeKit provides a binary VCF format that is compact, indexed by position. The binary files can be opened by VCFTable, which returns convenient Variant-based objects:

>>> from genome_kit import VCFTable

Suppose you have the following VCF saved as test.vcf.gz:

##fileformat=VCFv4.2
##reference=GRCh37
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele frequency">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths">
#CHROM POS     ID REF ALT QUAL FILTER INFO    FORMAT sample1  sample2  sample3
1      949523  .  C   T   .    .      AF=0.00 GT:AD  0/0:0,1  0/1:0,2  0/0:0,3
1      949608  .  G   A   .    .      AF=0.01 GT:AD  0/0:0,4  0/1:0,5  0/0:0,6
1      949696  .  -   G   .    .      AF=0.02 GT:AD  0/0:0,7  0/1:0,8  0/1:0,9
1      949739  .  G   TC  .    .      AF=0.03 GT:AD  0/1:0,10 0/0:0,11 1/1:0,12
1      977028  .  G   T   .    .      AF=0.04 GT:AD  0/1:0,13 0/0:0,14 1/1:0,15
1      977330  .  T   C   .    .      AF=0.05 GT:AD  0/1:0,16 0/0:0,17 ./.:0,18
1      977516  .  -   C   .    .      AF=0.06 GT:AD  1/1:0,19 1/1:0,20 ./.:0,21
1      977570  .  G   A   .    .      AF=0.07 GT:AD  1/1:0,22 1/1:0,23 ./.:0,24
1      978604  .  CT  -   .    .      AF=0.08 GT:AD  1/1:0,25 1/1:0,26 ./.:0,27
1      978628  .  C   T   .    .      AF=0.09 GT:AD  ./.:28,0 0/0:29,0 ./.:30,0

Open the VCF, making sure to carry over the AF, GT, and AD data:

>>> vcf = VCFTable.from_vcf("test.vcf.gz", Genome("hg19"), info_ids=["AF"], fmt_ids=["GT", "AD"])
>>> vcf
<VCFTable, len() = 10>

>>> vcf[0]
<VCFVariant chr1:949522:C:T:hg19>

Get a variant’s INFO attribute:

>>> vcf[5].AF
0.0500000007451

>>> vcf.info("AF")
array([ 0.  ,  0.01,  0.02,  ..., 0.07,  0.08,  0.09], dtype=float32)

Query an interval:

>>> interval = vcf[5].expand(300, 300)   # chr1:977030-977630
>>> variants = vcf.find_within(interval)
>>> variants
[<VCFVariant chr1:977329:T:C:hg19>,
 <VCFVariant chr1:977515::C:hg19>,
 <VCFVariant chr1:977569:G:A:hg19>]

Get indices of variants returned by a query:

>>> indices = [vcf.index_of(v) for v in variants]
>>> indices
[5, 6, 7]

Get per-sample genotype and allelic depth for specific variants:

>>> gt = vcf.format('GT')
>>> gt.shape
(10L, 3L)
>>> gt[indices]
array([[1, 0, 0],
       [2, 2, 0],
       [2, 2, 0]], dtype=int8)

>>> ad = vcf.format('AD')
>>> ad.shape
(10L, 3L)
>>> ad[indices]
array([[[ 0, 16],
        [ 0, 17],
        [ 0, 18]],

       [[ 0, 19],
        [ 0, 20],
        [ 0, 21]],

       [[ 0, 22],
        [ 0, 23],
        [ 0, 24]]], dtype=int32)

The fastest way to filter variants by FORMAT columns is to use numpy on the entire array:

>>> mask = np.any(ad.sum(axis=2) >= 20, axis=1)  # Find variants with at least one
>>> variants = vcf.where(mask)                   # sample having ad >= 20
>>> variants
[<VCFVariant chr1:977515::C:hg19>,
 <VCFVariant chr1:977569:G:A:hg19>,
 <VCFVariant chr1:978603:CT::hg19>,
 <VCFVariant chr1:978627:C:T:hg19>]

You can run the above examples with $python demos/query_vcf.py from the GenomeKit directory.

Feature Extraction with Variants

Besides reference genomes (Genome), GenomeKit can also represent a genome that has variants applied to it (VariantGenome). The idea is that you can pass either a reference genome or a variant genome to your feature extraction code, and both cases will work transparently.

VariantGenome will accept either a single Variant object, a variant string, or a list which can contain a mixture of both. When a Variant object is passed it is checked that it is defined on the same reference genome as the VariantGenome. When a string is supplied, as Variant object is created on the same reference genome as the VariantGenome.

Consider a toy example, where the only feature we extract is the DNA sequence flanking the 5’ end of a CFTR transcript:

def extract_features(genome):
    tran = genome.transcripts["ENST00000426809.1"]   # CFTR transcript
    span = tran.end5.expand(2, 5)                    # 7nt span at 5' end
    return genome.dna(span)                          # extract DNA

ref = Genome("gencode.v19")
variants = [Variant.from_string("chr7:117120149:A:G", ref),     # rs397508328
            Variant.from_string("chr7:117120151:G:T", ref)]     # rs397508657
var = VariantGenome(ref, variants)
print(extract_features(ref))
print(extract_features(var))

The above outputs reference DNA, and a variant DNA:

CCATGCA
CCGTTCA

However, unless one plans to only support SNVs and not INDELs, it is important to specify how each query interval should be aligned with respect to insertions/deletions. Read on and learn about VariantGenome and about the concept of “anchors”.

Variant genomes

Variant genomes are made by applying a zero or more variants to a reference genome via the VariantGenome class. If a list of variants is given, they are all applied as if they were a single complex variant:

ref = Genome("hg19")
var1 = VariantGenome(ref, ref.variant("chr7:117120188:A:T"))    # rs397508673 (A>T)
var2 = VariantGenome(ref, ref.variant("chr7:117120190:A:-"))    # rs397508710 (delA)
var3 = VariantGenome(ref, [ref.variant(x) for x in ["chr7:117120188:A:T",
                           "chr7:117120190:A:-"]])  # both variants together

Note

Variants are currently specified using string format chromosome:position:ref:alt where the position is DNA1, just like in ClinVar or the UCSC browser (technically VCF 4.0/4.1 standard). Clinvar also has a special format to denote indels that includes the preceding base as padding, such as in 'chr7:117,231,993:TCT:T'. GenomeKit can handle this format but it is optional. Therefore 'chr7:117,231,994:CT:' would be equivalent to the previous variant. GenomeKit also allows either -, or . to denote an empty ref or alt field (e.g. 'chr7:117,231,994:CT:.'). When the padding base in supplied, GenomeKit trims it off internally. GenomeKit also allows comma separators in the variant position (e.g. '117,231,993'), as well as both UCSC ('chr7') and ENSEMBL-style ('7') chromosome names (only for nuclear chromosomes 1–23, X, and Y).

Given a query interval, the variant genome’s dna() method returns the variant sequence, rather than the reference sequence:

>>> interval = Interval("chr7", "+", 117120185, 117120192, ref)
>>> ref.dna(interval)
'CCAAACT'
>>> var1.dna(interval)   # (A>T)
'CCTAACT'
>>> var2.dna(interval)   # (delA)
'CCAACT'
>>> var3.dna(interval)   # (A>T, delA)
'CCTACT'

Notice that when a length-changing variant falls within the query interval, the length of the result changes. This is only the default behaviour. The next section explains how to control alignment for feature extraction by ‘anchoring’ an interval.

Length-changing variants

Variants that insert or delete positions (INDELs) effectively change the coordinate system of the variant genome. If an interval is specified on the reference genome, and there are variants falling within that interval, then the manner in which it should be lifted to the variant genome is ambiguous. This section explains how you can control the lifting behaviour to suit your feature extraction needs.

The mechanism that GenomeKit provides is anchored intervals. The idea is that the user indicates a position within the reference interval that remains aligned when the interval is lifted over to the variant genome.

For example, you may want to anchor your interval to its 5’ or 3’ end:

>>> interval = Interval("chr7", "+", 117120185, 117120192, ref)
>>> anchored_5p = interval.with_anchor("5p")  # Anchored to its 5' end
>>> anchored_3p = interval.with_anchor("3p")  # Anchored to its 3' end

>>> ref = Genome("hg19")
>>> var = VariantGenome(ref, ref.variant("chr7:117120190:A:-"))  # rs397508710 (delA)
>>> ref.dna(interval)
'CCAAACT'
>>> var.dna(interval)     # (shrink 3' end)
'CCAACT'
>>> var.dna(anchored_5p)  # (fill 3' end)
'CCAACTT'
>>> var.dna(anchored_3p)  # (fill 5' end)
'TCCAACT'

Besides the “5p” and “3p” modes, there are other ways to anchor your intervals. See Anchors for a more in-depth explanation.

Motif finding

GenomeKit can find motifs in both reference and mutant sequences for you. This is based on string matching, not PWMs.

The example shows how to search for motifs on both the reference and a variant genome using the genome_kit.Genome.find_motif() and genome_kit.VariantGenome.find_motif():

>>> genome = Genome('hg19')

>>> # Short sequence from CFTR
>>> interval = Interval('chr7', '+', 117231957, 117232030, genome)
>>> genome.dna(interval)
'TTGATATTTATATGTTTTTATATCTTAAAGCTGTGTCTGTAAACTGATGGCTAACAAAACTAGGATTTTGGTC'

>>> motif = 'AACAA'
>>> matches = genome.find_motif(interval, motif)
>>> matches
[Interval("chr7", "+", 117232009, 117232009, "hg19", 117232009)]

The returned interval is empty but can be expanded for feature extraction:

>>> matches[0].expand(5, 5)
Interval("chr7", "+", 117232004, 117232014, "hg19", 117232009)

The returned interval always has its anchor set equal to its position. This means that the interval will always stay aligned to the same position on the reference genome when variants are applied.

The default is to return the empty interval upstream of the first nucleotide of the motif matches, i.e. the motif is immediately downstream. But in many cases we would like to change that behaviour. For example, when we match the acceptor core splice site motif AG, we usually want the match to be aligned to the 3’ end of the AG motif. For this reason, find_motif() supports the match_position argument. The default is match_position=0 (equivalently match_position='5p'), which aligns the match to the 5’ end of the motif. In the above case of searching for acceptor motifs, we can set match_position='3p' to return potential splice sites:

>>> motif = 'AG'
>>> matches = genome.find_motif(interval, motif, match_position='3p')
>>> matches
[Interval("chr7", "+", 117231987, 117231987, "hg19", 117231987),
 Interval("chr7", "+", 117232020, 117232020, "hg19", 117232020)]

We can verify that the AG is now immediately upstream of the returned empty intervals:

>>> [genome.dna(match.expand(2, 0)) for match in matches]
['AG', 'AG']

Alternatively, match_position can also be an integer: match_position=0 is equivalent to match_position='5p', match_position=len(motif) is equivalent to match_position='3p' and integers within that range match positions within the motif.

By default find_motif() returns only non overlapping matches, but this can be configured with the find_overlapping_matches parameter:

>>> interval = Interval('chr7', '+', 117231957, 117232030, genome)
>>> motif = 'TT'
>>> genome.dna(interval)
'TTGATATTTATATGTTTTTA'
>>> genome.find_motif(interval, motif, find_overlapping_motifs=False)
[Interval("chr7", "+", 117231957, 117231957, "hg19", 117231957),
 Interval("chr7", "+", 117231963, 117231963, "hg19", 117231963),
 Interval("chr7", "+", 117231971, 117231971, "hg19", 117231971),
 Interval("chr7", "+", 117231973, 117231973, "hg19", 117231973),
 Interval("chr7", "+", 117231981, 117231981, "hg19", 117231981),
 Interval("chr7", "+", 117232022, 117232022, "hg19", 117232022),
 Interval("chr7", "+", 117232024, 117232024, "hg19", 117232024)]]
>>> genome.find_motif(interval, motif, find_overlapping_motifs=True)
[Interval("chr7", "+", 117231957, 117231957, "hg19", 117231957),
 Interval("chr7", "+", 117231963, 117231963, "hg19", 117231963),
 Interval("chr7", "+", 117231964, 117231964, "hg19", 117231964),
 Interval("chr7", "+", 117231971, 117231971, "hg19", 117231971),
 Interval("chr7", "+", 117231972, 117231972, "hg19", 117231972),
 Interval("chr7", "+", 117231973, 117231973, "hg19", 117231973),
 Interval("chr7", "+", 117231974, 117231974, "hg19", 117231974),
 Interval("chr7", "+", 117231981, 117231981, "hg19", 117231981),
 Interval("chr7", "+", 117232022, 117232022, "hg19", 117232022),
 Interval("chr7", "+", 117232023, 117232023, "hg19", 117232023),
 Interval("chr7", "+", 117232024, 117232024, "hg19", 117232024)]

The find_motif() works the same on VariantGenome as on the reference genome:

>>> variant = genome.variant("chr7:117231980::TTAGTT")  # Insertion
>>> variant_genome = VariantGenome(genome, variant)
>>> motif = 'AG'
>>> matches = variant_genome.find_motif(interval, motif,
...                                     match_position='3p')
>>> matches
[Interval("chr7", "+", 117231979, 117231979, "hg19", 117231979, 4),
 Interval("chr7", "+", 117231987, 117231987, "hg19", 117231987),
 Interval("chr7", "+", 117232020, 117232020, "hg19", 117232020)]

The result are the previous two matches plus a third one where we inserted the motif into the reference sequence. This is a special case since the position within the insertion has no alignment to the reference genome. In this case the interval has its anchor_offset attribute set to four to indicate that the motif matches at the fourth position in the insertion.

Sharing new data files

Data such as assemblies, annotations, tracks, etc is stored in custom-built binary files. APIs for building these files are provided as part of the API. GenomeKit first searches for data files in the directory specified by the environment variable GENOMEKIT_DATA_DIR, defaulting to appdirs.user_data_dir("genome_kit"). If the file is not found, by default GenomeKit attempts to download it from a public read-only Google Cloud Storage bucket.

You can use your own GCS bucket by setting the environment variable GENOMEKIT_STORAGE_BUCKET, allowing you to upload and share your data files.

If you wish to use a different mechanism to store remote data files, you can provide an implementation of genome_kit.data_manager.DataManager and register it:

class MyDataManager(DataManager):
        def __init__(self, data_dir: str):
            ...

        def get_file(self, filename: str) -> str:
            ...

        def upload_file(self, filepath: str, filename: str, metadata: Dict[str, str]=None):
            ...

gk.gk_data.data_manager = MyDataManager()

# Alternatively, you can install a plugin package that advertises a DataManager implementation
# endpoint under the "genomekit.plugins.data_manager" group. GenomeKit will automatically
# use the plugin's data manager.
# (see https://setuptools.pypa.io/en/latest/userguide/entry_point.html#entry-points-for-plugins)

Logging and troubleshooting

Set the GENOMEKIT_QUIET to any value to suppress logging output.

Set the GENOMEKIT_TRACE to any value to enable trace logging.