Developers

Developing for GenomeKit currently requires a C++ compiler that support C++20 features. Both can be installed via conda (gxx_linux-64 and `clangxx_osx-64). For OSX, you will also need the SDK via xcode-select --install or from https://developer.apple.com/download/all/.

Setting up

Clone the source tree:

git clone git@github.com:deepgenomics/GenomeKit.git

From the GenomeKit directory, install the provided conda environment which contains all dependencies:

conda env create -f genomekit_dev.yml
conda activate genomekit_dev

On Windows, you’ll need to comment out the mac/linux only test dependencies from genomekit_dev.yml.

On M1 macs, you might need to set up the environment differently:

conda create -n cxx cxx-compiler zlib
conda activate cxx
conda install -c conda-forge -c bioconda --file a-file-with-the-deps-from-genomekit_dev-yml.txt

Build the package in development mode:

pip install -e .

This builds the C++ extension and copies it into your source tree (genome_kit/_cxx.so). It also ensures that import genome_kit works from any directory by linking your source tree from python’s site-packages.

Note

Windows Prerequisites

You will need VS 2019 or newer installed. To get a compatible shell, either locate and run vcvars64.bat, or start the x64 Native Tools Command Prompt from the Start menu. To open VS with a preconfigured project, directly run in that command prompt:

.vcproj\genome_kit.sln

Finally, run the all the tests:

python -m unittest discover

You can also run examples from the demos directory.

Jetbrains CLion setup

In the CMake settings, set the following environment variables:

IN_CLION=1;CONDA_PREFIX=$HOME/conda/envs/genomekit_dev

Making changes

If the C/C++ code changed, you must re-run the develop command:

pip install -e .

This includes switching branches, merging changes, or editing the C/C++ code yourself. Forgetting this step may lead to unpredictable behaviour.

Tip

To speed up compilation on Ubuntu or Mac, install ccache.

Before checking in any changes, run all tests locally:

python -m unittest discover

Adding tests

Tests are located in the tests directory, and any data they need is located in the tests/data directory.

While developing a test, you may want to run it repeatedly, without all other tests. For example, to run just the TestInterval.test_serialize method in tests/test_interval.py use:

python -m unittest tests.test_interval.TestInterval.test_serialize

C++ tests

To test C++ code directly, you can compile and run src/main.cpp:

cmake -DCMAKE_BUILD_TYPE=Debug -B unittestbuild
cmake --build unittestbuild --parallel --verbose --target main test

Debugging tests

Define envvar GK_DEBUGBREAK to break upon GK_CHECK failures when running under a debugger.

Building data files

GenomeKit relies on many pre-built files. For example, the binary annotation gencode.v19.annotation.dganno is built from gencode.v19.annotation.gff3.gz. Reasons to re-build these files include:

Changes to the binary file format.
Updates to the source data.
Changes to the processing of source data.

GenomeKit has two sets of data files:

Full data files are for normal use. They are stored remotely in the GenomeKit store and pulled to the user’s local file system on-demand.
Test data files are for testing. They are tiny excerpts of the full files, small enough to check in to source control, fast enough to run in continuous integration testing. They are stored in the source tree under tests/data.

The genome_kit module’s build command can be used to build full Appris/MANE data files, and Appris/MANE/dganno/2bit test data files.

For a full set of options, run:

python -m genome_kit build --help

Building full data files

For instructions on how to build annotation (dganno) files and assembly (2bit) files, see Genomes.

Full-sized data files reside in a local user directory reserved for GenomeKit, downloaded from the data store on-demand.

Note

See the API Documentation for instructions on how to build data tracks, read alignments, read distributions, junction read alignments, and VCF tables.

Building test data files

Test data files reside in the source tree under tests/data. To build them, you must have registered your source tree in develop mode:

pip install -e .

Now that your source tree is the default genome_kit import, the build subcommand will be able to find your test data directory.

To build test annotation, 2bit, Appris, and MANE files, use –test-<type> flags on the build subcommand:

python -m genome_kit build --test-anno --test-2bit --test-appris --test-mane

Releasing GenomeKit

The GenomeKit repo uses the Release Please bot to create Github releases based on PRs. When the bot creates a PR, you can merge it to create a release.

Once a Github release is created, a PR will automatically be created in the GenomeKit conda-forge feedstock repo by regro-cf-autotick-bot. Once that PR is merged, conda-forge’s CI pipeline is kicked off and the new version of GenomeKit is built and published to conda-forge.