Genomes

GenomeKit now supports building and using assemblies and annotations. For assemblies, the schema will follow the UCSC format, and for annotations, they must be specified in GENCODE/Ensembl/NCBI GFF3 formats.

Examples

Clone the GenomeKit git repo to see scripts under data-src/ for examples of how to build annotation data files.

git clone https://github.com/deepgenomics/GenomeKit.git
pushd GenomeKit

Scripts under data-src are used to obtain and generate the data files:

data-src/<assembly>/assembly for the assembly, e.g data-src/hg19/assembly
data-src/<assembly>/<annotation-source>/<annotation>, e.g data-src/hg19/GENCODE/v26lift37

Assemblies

Generate a hash file

echo $(python -c 'import genome_kit as gk; print(gk.Genome._refg_hash("hg19"))') > hg19.hash

(replace hg19 with the desired assembly name)

Copy the 2bit, chrom.sizes, and chromAlias.txt files from https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/. and the hash file you generated into your
```
python -c 'import os ; import appdirs ; print(os.environ.get("GENOMEKIT_DATA_DIR", appdirs.user_data_dir("genome_kit")))'
```
directory.

If you need to generate from a fasta:
1. ```
conda create -n ucsc-tools ucsc-fatotwobit ucsc-twobitinfo
```
2. ```
conda activate ucsc-tools
```
3. follow the instructions at https://genome.ucsc.edu/goldenPath/help/twoBit.html
4. optionally create an chromAlias.txt with any contig aliases required.

Annotations

python -c 'import genome_kit as gk; print(gk.GenomeAnnotation.build_gencode("MY_ANNO.gff3", "MY_ANNO", gk.Genome("MY_ASSEMBLY")))'

Copy the resulting files into the
```
python -c 'import appdirs; print(appdirs.user_data_dir("genome_kit"))'
```
directory.

The .dganno file contains the compiled GFF3 and the .cfg file contains metadata, such as refg=hg38.

APPRIS / MANE

When adding an annotation, you can also generate APPRIS/MANE data files for it if public data is available.

APPRIS files are available on https://apprisws.bioinfo.cnio.es/pub/releases/. As noted in the GenomeKit source code, we maintain a partial archive of APPRIS.

GENCODE

For GENCODE annotations, first find the matching Ensembl release. For example, for GENCODE v47, the matching Ensembl release is 113. So you’ll need to find an APPRIS release that includes e113. The earliest release that includes e113 is 2024_10.v49 (e113v49).

If our archive doesn’t already includes this release, you’ll need to download the release and add it to the archive.

RefSeq

You can similarly find the matching release for RefSeq annotations. For example, for RefSeq v110, the earliest APPRIS release you can find rs110 is 2023_05.v48 (rs110v48).

MANE

For MANE releases, search through versions on https://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/. Each MANE release includes a README_versions.txt the releated Ensembl and RefSeq releases.

For help on building APPRIS/MANE:

python -m genome_kit build --help